You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Lin Ma <li...@gmail.com> on 2012/10/19 12:50:30 UTC

Hadoop counter

Hi guys,

I have some quick questions regarding to Hadoop counter,


   - Hadoop counter (customer defined) is global accessible (for both read
   and write) for all Mappers and Reducers in a job?
   - What is the performance and best practices of using Hadoop counters? I
   am not sure if using Hadoop counters too heavy, there will be performance
   downgrade to the whole job?

regards,
Lin

RE: Java heap space error

Posted by "Kartashov, Andy" <An...@mpac.ca>.

Subash,

I have been experiencing this type of an error at some point and no matter how much I played with heap size  it didn't work. What I found out at the end is I was running out of physical memory. My output file was about 4Gb with only 2.5Gb of free space available. Check you space with "$hadop fs -df

From: Subash D'Souza [mailto:sdsouza@truecar.com]
Sent: Sunday, October 21, 2012 9:19 AM
To: user@hadoop.apache.org
Subject: Java heap space error

I'm running CDH 4 on  a 4 node cluster each with 96 G of RAM. Up until last week the cluster was running until there was an error in the name node log file and I had to reformat it put the data back

Now when I run hive on YARN. I keep getting a Java heap space error. Based on the research I did. I upped the my mapred.child.java.opts first from 200m to 400 m to 800m and I still have the same issue. It seems to fail near the 100% mapper mark

I checked the log files and the only thing that it does output is java heap space error. Nothing more.

Any help would be appreciated.

Thanks
Subash

NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel

RE: Java heap space error

Posted by "Kartashov, Andy" <An...@mpac.ca>.

Subash,

I have been experiencing this type of an error at some point and no matter how much I played with heap size  it didn't work. What I found out at the end is I was running out of physical memory. My output file was about 4Gb with only 2.5Gb of free space available. Check you space with "$hadop fs -df

From: Subash D'Souza [mailto:sdsouza@truecar.com]
Sent: Sunday, October 21, 2012 9:19 AM
To: user@hadoop.apache.org
Subject: Java heap space error

I'm running CDH 4 on  a 4 node cluster each with 96 G of RAM. Up until last week the cluster was running until there was an error in the name node log file and I had to reformat it put the data back

Now when I run hive on YARN. I keep getting a Java heap space error. Based on the research I did. I upped the my mapred.child.java.opts first from 200m to 400 m to 800m and I still have the same issue. It seems to fail near the 100% mapper mark

I checked the log files and the only thing that it does output is java heap space error. Nothing more.

Any help would be appreciated.

Thanks
Subash

NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel

RE: Java heap space error

Posted by "Kartashov, Andy" <An...@mpac.ca>.

Subash,

I have been experiencing this type of an error at some point and no matter how much I played with heap size  it didn't work. What I found out at the end is I was running out of physical memory. My output file was about 4Gb with only 2.5Gb of free space available. Check you space with "$hadop fs -df

From: Subash D'Souza [mailto:sdsouza@truecar.com]
Sent: Sunday, October 21, 2012 9:19 AM
To: user@hadoop.apache.org
Subject: Java heap space error

I'm running CDH 4 on  a 4 node cluster each with 96 G of RAM. Up until last week the cluster was running until there was an error in the name node log file and I had to reformat it put the data back

Now when I run hive on YARN. I keep getting a Java heap space error. Based on the research I did. I upped the my mapred.child.java.opts first from 200m to 400 m to 800m and I still have the same issue. It seems to fail near the 100% mapper mark

I checked the log files and the only thing that it does output is java heap space error. Nothing more.

Any help would be appreciated.

Thanks
Subash

NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel

RE: Java heap space error

Posted by "Kartashov, Andy" <An...@mpac.ca>.

Subash,

I have been experiencing this type of an error at some point and no matter how much I played with heap size  it didn't work. What I found out at the end is I was running out of physical memory. My output file was about 4Gb with only 2.5Gb of free space available. Check you space with "$hadop fs -df

From: Subash D'Souza [mailto:sdsouza@truecar.com]
Sent: Sunday, October 21, 2012 9:19 AM
To: user@hadoop.apache.org
Subject: Java heap space error

I'm running CDH 4 on  a 4 node cluster each with 96 G of RAM. Up until last week the cluster was running until there was an error in the name node log file and I had to reformat it put the data back

Now when I run hive on YARN. I keep getting a Java heap space error. Based on the research I did. I upped the my mapred.child.java.opts first from 200m to 400 m to 800m and I still have the same issue. It seems to fail near the 100% mapper mark

I checked the log files and the only thing that it does output is java heap space error. Nothing more.

Any help would be appreciated.

Thanks
Subash

NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel

Re: Java heap space error

Posted by Michael Segel <mi...@hotmail.com>.

Try upping the child to 1.5GB or more.

On Oct 21, 2012, at 8:18 AM, Subash D'Souza <sd...@truecar.com> wrote:

> I'm running CDH 4 on  a 4 node cluster each with 96 G of RAM. Up until last week the cluster was running until there was an error in the name node log file and I had to reformat it put the data back 
> 
> Now when I run hive on YARN. I keep getting a Java heap space error. Based on the research I did. I upped the my mapred.child.java.opts first from 200m to 400 m to 800m and I still have the same issue. It seems to fail near the 100% mapper mark
> 
> I checked the log files and the only thing that it does output is java heap space error. Nothing more.
> 
> Any help would be appreciated.
> 
> Thanks
> Subash
>

Re: Java heap space error

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

Did this job ever run successfully for you? With 200m heap size?

Seems like your maps are failing. Can you paste your settings for the following:
 - io.sort.factor
 - io.sort.mb
 - mapreduce.map.sort.spill.percent

Thanks,
+Vinod

On Oct 21, 2012, at 6:18 AM, Subash D'Souza wrote:

> I'm running CDH 4 on  a 4 node cluster each with 96 G of RAM. Up until last week the cluster was running until there was an error in the name node log file and I had to reformat it put the data back 
> 
> Now when I run hive on YARN. I keep getting a Java heap space error. Based on the research I did. I upped the my mapred.child.java.opts first from 200m to 400 m to 800m and I still have the same issue. It seems to fail near the 100% mapper mark
> 
> I checked the log files and the only thing that it does output is java heap space error. Nothing more.
> 
> Any help would be appreciated.
> 
> Thanks
> Subash
>

Re: Java heap space error

Posted by Michael Segel <mi...@hotmail.com>.

Try upping the child to 1.5GB or more.

On Oct 21, 2012, at 8:18 AM, Subash D'Souza <sd...@truecar.com> wrote:

> I'm running CDH 4 on  a 4 node cluster each with 96 G of RAM. Up until last week the cluster was running until there was an error in the name node log file and I had to reformat it put the data back 
> 
> Now when I run hive on YARN. I keep getting a Java heap space error. Based on the research I did. I upped the my mapred.child.java.opts first from 200m to 400 m to 800m and I still have the same issue. It seems to fail near the 100% mapper mark
> 
> I checked the log files and the only thing that it does output is java heap space error. Nothing more.
> 
> Any help would be appreciated.
> 
> Thanks
> Subash
>

Re: Java heap space error

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

Did this job ever run successfully for you? With 200m heap size?

Seems like your maps are failing. Can you paste your settings for the following:
 - io.sort.factor
 - io.sort.mb
 - mapreduce.map.sort.spill.percent

Thanks,
+Vinod

On Oct 21, 2012, at 6:18 AM, Subash D'Souza wrote:

> I'm running CDH 4 on  a 4 node cluster each with 96 G of RAM. Up until last week the cluster was running until there was an error in the name node log file and I had to reformat it put the data back 
> 
> Now when I run hive on YARN. I keep getting a Java heap space error. Based on the research I did. I upped the my mapred.child.java.opts first from 200m to 400 m to 800m and I still have the same issue. It seems to fail near the 100% mapper mark
> 
> I checked the log files and the only thing that it does output is java heap space error. Nothing more.
> 
> Any help would be appreciated.
> 
> Thanks
> Subash
>

Re: Java heap space error

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

Did this job ever run successfully for you? With 200m heap size?

Seems like your maps are failing. Can you paste your settings for the following:
 - io.sort.factor
 - io.sort.mb
 - mapreduce.map.sort.spill.percent

Thanks,
+Vinod

On Oct 21, 2012, at 6:18 AM, Subash D'Souza wrote:

> I'm running CDH 4 on  a 4 node cluster each with 96 G of RAM. Up until last week the cluster was running until there was an error in the name node log file and I had to reformat it put the data back 
> 
> Now when I run hive on YARN. I keep getting a Java heap space error. Based on the research I did. I upped the my mapred.child.java.opts first from 200m to 400 m to 800m and I still have the same issue. It seems to fail near the 100% mapper mark
> 
> I checked the log files and the only thing that it does output is java heap space error. Nothing more.
> 
> Any help would be appreciated.
> 
> Thanks
> Subash
>

Re: Java heap space error

Posted by Michael Segel <mi...@hotmail.com>.

Try upping the child to 1.5GB or more.

On Oct 21, 2012, at 8:18 AM, Subash D'Souza <sd...@truecar.com> wrote:

> I'm running CDH 4 on  a 4 node cluster each with 96 G of RAM. Up until last week the cluster was running until there was an error in the name node log file and I had to reformat it put the data back 
> 
> Now when I run hive on YARN. I keep getting a Java heap space error. Based on the research I did. I upped the my mapred.child.java.opts first from 200m to 400 m to 800m and I still have the same issue. It seems to fail near the 100% mapper mark
> 
> I checked the log files and the only thing that it does output is java heap space error. Nothing more.
> 
> Any help would be appreciated.
> 
> Thanks
> Subash
>

Re: Java heap space error

Posted by Michael Segel <mi...@hotmail.com>.

Try upping the child to 1.5GB or more.

On Oct 21, 2012, at 8:18 AM, Subash D'Souza <sd...@truecar.com> wrote:

> I'm running CDH 4 on  a 4 node cluster each with 96 G of RAM. Up until last week the cluster was running until there was an error in the name node log file and I had to reformat it put the data back 
> 
> Now when I run hive on YARN. I keep getting a Java heap space error. Based on the research I did. I upped the my mapred.child.java.opts first from 200m to 400 m to 800m and I still have the same issue. It seems to fail near the 100% mapper mark
> 
> I checked the log files and the only thing that it does output is java heap space error. Nothing more.
> 
> Any help would be appreciated.
> 
> Thanks
> Subash
>

Re: Java heap space error

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

Did this job ever run successfully for you? With 200m heap size?

Seems like your maps are failing. Can you paste your settings for the following:
 - io.sort.factor
 - io.sort.mb
 - mapreduce.map.sort.spill.percent

Thanks,
+Vinod

On Oct 21, 2012, at 6:18 AM, Subash D'Souza wrote:

> I'm running CDH 4 on  a 4 node cluster each with 96 G of RAM. Up until last week the cluster was running until there was an error in the name node log file and I had to reformat it put the data back 
> 
> Now when I run hive on YARN. I keep getting a Java heap space error. Based on the research I did. I upped the my mapred.child.java.opts first from 200m to 400 m to 800m and I still have the same issue. It seems to fail near the 100% mapper mark
> 
> I checked the log files and the only thing that it does output is java heap space error. Nothing more.
> 
> Any help would be appreciated.
> 
> Thanks
> Subash
>

Re: Java heap space error

Posted by Subash D'Souza <sd...@truecar.com>.

Here's the mapped-site and yarn-site configs
Mapred-site.xml
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>hadoop1.rad.wc.truecarcorp.com:8021</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:19888</value>
</property>
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx800m</value>
</property>
<property>
<name>mapred.map.child.java.opts</name>
<value>-Xmx800m</value>
</property>
<property>
<name>mapred.reduce.child.java.opts</name>
<value>-Xmx800m</value>
</property>
</configuration>



Yarn-site.xml

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
</property>

<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:8088</value>
</property>
<property>
<description>Where to aggregate logs to.</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/var/log/hadoop-yarn/apps</value>
</property>

<property>
<description>Classpath for typical applications.</description>
<name>yarn.application.classpath</name>
<value>
$HADOOP_CONF_DIR,
$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
      $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
$YARN_HOME/*,$YARN_HOME/lib/*
</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/home/data/1/yarn/local,/home/data/2/yarn/local,/home/data/3/yarn/lo
cal</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/home/data/1/yarn/logs,/home/data/2/yarn/logs,/home/data/3/yarn/logs
</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:19888</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/home/data/tmp</value>
</property>
 <property>
	    <name>yarn.nodemanager.resource.memory-mb</name>
	    <value>84000</value>
	    </property>


</configuration>






On 10/21/12 7:22 AM, " Marcos Ortiz Valmaseda" <ml...@uci.cu> wrote:

>Regards, Subash.
>Can you share more information about your YARN cluster?
>
>----- Mensaje original -----
>De: Subash D'Souza <sd...@truecar.com>
>Para: user@hadoop.apache.org
>Enviado: Sun, 21 Oct 2012 09:18:43 -0400 (CDT)
>Asunto: Java heap space error
>
>I'm running CDH 4 on  a 4 node cluster each with 96 G of RAM. Up until
>last week the cluster was running until there was an error in the name
>node log file and I had to reformat it put the data back
>
>Now when I run hive on YARN. I keep getting a Java heap space error.
>Based on the research I did. I upped the my mapred.child.java.opts first
>from 200m to 400 m to 800m and I still have the same issue. It seems to
>fail near the 100% mapper mark
>
>I checked the log files and the only thing that it does output is java
>heap space error. Nothing more.
>
>Any help would be appreciated.
>
>Thanks
>Subash
>
>
>
>10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
>INFORMATICAS...
>CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
>http://www.uci.cu
>http://www.facebook.com/universidad.uci
>http://www.flickr.com/photos/universidad_uci
>
>
>
>10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
>INFORMATICAS...
>CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
>http://www.uci.cu
>http://www.facebook.com/universidad.uci
>http://www.flickr.com/photos/universidad_uci

Re: Java heap space error

Posted by Subash D'Souza <sd...@truecar.com>.

Here's the mapped-site and yarn-site configs
Mapred-site.xml
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>hadoop1.rad.wc.truecarcorp.com:8021</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:19888</value>
</property>
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx800m</value>
</property>
<property>
<name>mapred.map.child.java.opts</name>
<value>-Xmx800m</value>
</property>
<property>
<name>mapred.reduce.child.java.opts</name>
<value>-Xmx800m</value>
</property>
</configuration>



Yarn-site.xml

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
</property>

<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:8088</value>
</property>
<property>
<description>Where to aggregate logs to.</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/var/log/hadoop-yarn/apps</value>
</property>

<property>
<description>Classpath for typical applications.</description>
<name>yarn.application.classpath</name>
<value>
$HADOOP_CONF_DIR,
$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
      $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
$YARN_HOME/*,$YARN_HOME/lib/*
</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/home/data/1/yarn/local,/home/data/2/yarn/local,/home/data/3/yarn/lo
cal</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/home/data/1/yarn/logs,/home/data/2/yarn/logs,/home/data/3/yarn/logs
</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:19888</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/home/data/tmp</value>
</property>
 <property>
	    <name>yarn.nodemanager.resource.memory-mb</name>
	    <value>84000</value>
	    </property>


</configuration>






On 10/21/12 7:22 AM, " Marcos Ortiz Valmaseda" <ml...@uci.cu> wrote:

>Regards, Subash.
>Can you share more information about your YARN cluster?
>
>----- Mensaje original -----
>De: Subash D'Souza <sd...@truecar.com>
>Para: user@hadoop.apache.org
>Enviado: Sun, 21 Oct 2012 09:18:43 -0400 (CDT)
>Asunto: Java heap space error
>
>I'm running CDH 4 on  a 4 node cluster each with 96 G of RAM. Up until
>last week the cluster was running until there was an error in the name
>node log file and I had to reformat it put the data back
>
>Now when I run hive on YARN. I keep getting a Java heap space error.
>Based on the research I did. I upped the my mapred.child.java.opts first
>from 200m to 400 m to 800m and I still have the same issue. It seems to
>fail near the 100% mapper mark
>
>I checked the log files and the only thing that it does output is java
>heap space error. Nothing more.
>
>Any help would be appreciated.
>
>Thanks
>Subash
>
>
>
>10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
>INFORMATICAS...
>CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
>http://www.uci.cu
>http://www.facebook.com/universidad.uci
>http://www.flickr.com/photos/universidad_uci
>
>
>
>10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
>INFORMATICAS...
>CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
>http://www.uci.cu
>http://www.facebook.com/universidad.uci
>http://www.flickr.com/photos/universidad_uci

Re: Java heap space error

Posted by Subash D'Souza <sd...@truecar.com>.

Here's the mapped-site and yarn-site configs
Mapred-site.xml
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>hadoop1.rad.wc.truecarcorp.com:8021</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:19888</value>
</property>
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx800m</value>
</property>
<property>
<name>mapred.map.child.java.opts</name>
<value>-Xmx800m</value>
</property>
<property>
<name>mapred.reduce.child.java.opts</name>
<value>-Xmx800m</value>
</property>
</configuration>



Yarn-site.xml

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
</property>

<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:8088</value>
</property>
<property>
<description>Where to aggregate logs to.</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/var/log/hadoop-yarn/apps</value>
</property>

<property>
<description>Classpath for typical applications.</description>
<name>yarn.application.classpath</name>
<value>
$HADOOP_CONF_DIR,
$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
      $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
$YARN_HOME/*,$YARN_HOME/lib/*
</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/home/data/1/yarn/local,/home/data/2/yarn/local,/home/data/3/yarn/lo
cal</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/home/data/1/yarn/logs,/home/data/2/yarn/logs,/home/data/3/yarn/logs
</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:19888</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/home/data/tmp</value>
</property>
 <property>
	    <name>yarn.nodemanager.resource.memory-mb</name>
	    <value>84000</value>
	    </property>


</configuration>






On 10/21/12 7:22 AM, " Marcos Ortiz Valmaseda" <ml...@uci.cu> wrote:

>Regards, Subash.
>Can you share more information about your YARN cluster?
>
>----- Mensaje original -----
>De: Subash D'Souza <sd...@truecar.com>
>Para: user@hadoop.apache.org
>Enviado: Sun, 21 Oct 2012 09:18:43 -0400 (CDT)
>Asunto: Java heap space error
>
>I'm running CDH 4 on  a 4 node cluster each with 96 G of RAM. Up until
>last week the cluster was running until there was an error in the name
>node log file and I had to reformat it put the data back
>
>Now when I run hive on YARN. I keep getting a Java heap space error.
>Based on the research I did. I upped the my mapred.child.java.opts first
>from 200m to 400 m to 800m and I still have the same issue. It seems to
>fail near the 100% mapper mark
>
>I checked the log files and the only thing that it does output is java
>heap space error. Nothing more.
>
>Any help would be appreciated.
>
>Thanks
>Subash
>
>
>
>10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
>INFORMATICAS...
>CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
>http://www.uci.cu
>http://www.facebook.com/universidad.uci
>http://www.flickr.com/photos/universidad_uci
>
>
>
>10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
>INFORMATICAS...
>CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
>http://www.uci.cu
>http://www.facebook.com/universidad.uci
>http://www.flickr.com/photos/universidad_uci

Re: Java heap space error

Posted by Subash D'Souza <sd...@truecar.com>.

Here's the mapped-site and yarn-site configs
Mapred-site.xml
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>hadoop1.rad.wc.truecarcorp.com:8021</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:19888</value>
</property>
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx800m</value>
</property>
<property>
<name>mapred.map.child.java.opts</name>
<value>-Xmx800m</value>
</property>
<property>
<name>mapred.reduce.child.java.opts</name>
<value>-Xmx800m</value>
</property>
</configuration>



Yarn-site.xml

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
</property>

<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:8088</value>
</property>
<property>
<description>Where to aggregate logs to.</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/var/log/hadoop-yarn/apps</value>
</property>

<property>
<description>Classpath for typical applications.</description>
<name>yarn.application.classpath</name>
<value>
$HADOOP_CONF_DIR,
$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
      $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
$YARN_HOME/*,$YARN_HOME/lib/*
</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/home/data/1/yarn/local,/home/data/2/yarn/local,/home/data/3/yarn/lo
cal</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/home/data/1/yarn/logs,/home/data/2/yarn/logs,/home/data/3/yarn/logs
</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop1.rad.wc.truecarcorp.com:19888</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/home/data/tmp</value>
</property>
 <property>
	    <name>yarn.nodemanager.resource.memory-mb</name>
	    <value>84000</value>
	    </property>


</configuration>






On 10/21/12 7:22 AM, " Marcos Ortiz Valmaseda" <ml...@uci.cu> wrote:

>Regards, Subash.
>Can you share more information about your YARN cluster?
>
>----- Mensaje original -----
>De: Subash D'Souza <sd...@truecar.com>
>Para: user@hadoop.apache.org
>Enviado: Sun, 21 Oct 2012 09:18:43 -0400 (CDT)
>Asunto: Java heap space error
>
>I'm running CDH 4 on  a 4 node cluster each with 96 G of RAM. Up until
>last week the cluster was running until there was an error in the name
>node log file and I had to reformat it put the data back
>
>Now when I run hive on YARN. I keep getting a Java heap space error.
>Based on the research I did. I upped the my mapred.child.java.opts first
>from 200m to 400 m to 800m and I still have the same issue. It seems to
>fail near the 100% mapper mark
>
>I checked the log files and the only thing that it does output is java
>heap space error. Nothing more.
>
>Any help would be appreciated.
>
>Thanks
>Subash
>
>
>
>10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
>INFORMATICAS...
>CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
>http://www.uci.cu
>http://www.facebook.com/universidad.uci
>http://www.flickr.com/photos/universidad_uci
>
>
>
>10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
>INFORMATICAS...
>CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
>http://www.uci.cu
>http://www.facebook.com/universidad.uci
>http://www.flickr.com/photos/universidad_uci

Re: Java heap space error

Posted by Marcos Ortiz Valmaseda <ml...@uci.cu>.

Regards, Subash.
Can you share more information about your YARN cluster?

----- Mensaje original -----
De: Subash D'Souza <sd...@truecar.com>
Para: user@hadoop.apache.org
Enviado: Sun, 21 Oct 2012 09:18:43 -0400 (CDT)
Asunto: Java heap space error

I'm running CDH 4 on  a 4 node cluster each with 96 G of RAM. Up until last week the cluster was running until there was an error in the name node log file and I had to reformat it put the data back

Now when I run hive on YARN. I keep getting a Java heap space error. Based on the research I did. I upped the my mapred.child.java.opts first from 200m to 400 m to 800m and I still have the same issue. It seems to fail near the 100% mapper mark

I checked the log files and the only thing that it does output is java heap space error. Nothing more.

Any help would be appreciated.

Thanks
Subash



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Java heap space error

Posted by Marcos Ortiz Valmaseda <ml...@uci.cu>.

Regards, Subash.
Can you share more information about your YARN cluster?

----- Mensaje original -----
De: Subash D'Souza <sd...@truecar.com>
Para: user@hadoop.apache.org
Enviado: Sun, 21 Oct 2012 09:18:43 -0400 (CDT)
Asunto: Java heap space error

I'm running CDH 4 on  a 4 node cluster each with 96 G of RAM. Up until last week the cluster was running until there was an error in the name node log file and I had to reformat it put the data back

Now when I run hive on YARN. I keep getting a Java heap space error. Based on the research I did. I upped the my mapred.child.java.opts first from 200m to 400 m to 800m and I still have the same issue. It seems to fail near the 100% mapper mark

I checked the log files and the only thing that it does output is java heap space error. Nothing more.

Any help would be appreciated.

Thanks
Subash



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Java heap space error

Posted by Marcos Ortiz Valmaseda <ml...@uci.cu>.

Regards, Subash.
Can you share more information about your YARN cluster?

----- Mensaje original -----
De: Subash D'Souza <sd...@truecar.com>
Para: user@hadoop.apache.org
Enviado: Sun, 21 Oct 2012 09:18:43 -0400 (CDT)
Asunto: Java heap space error

I'm running CDH 4 on  a 4 node cluster each with 96 G of RAM. Up until last week the cluster was running until there was an error in the name node log file and I had to reformat it put the data back

Now when I run hive on YARN. I keep getting a Java heap space error. Based on the research I did. I upped the my mapred.child.java.opts first from 200m to 400 m to 800m and I still have the same issue. It seems to fail near the 100% mapper mark

I checked the log files and the only thing that it does output is java heap space error. Nothing more.

Any help would be appreciated.

Thanks
Subash



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Java heap space error

Posted by Marcos Ortiz Valmaseda <ml...@uci.cu>.

Regards, Subash.
Can you share more information about your YARN cluster?

----- Mensaje original -----
De: Subash D'Souza <sd...@truecar.com>
Para: user@hadoop.apache.org
Enviado: Sun, 21 Oct 2012 09:18:43 -0400 (CDT)
Asunto: Java heap space error

I'm running CDH 4 on  a 4 node cluster each with 96 G of RAM. Up until last week the cluster was running until there was an error in the name node log file and I had to reformat it put the data back

Now when I run hive on YARN. I keep getting a Java heap space error. Based on the research I did. I upped the my mapred.child.java.opts first from 200m to 400 m to 800m and I still have the same issue. It seems to fail near the 100% mapper mark

I checked the log files and the only thing that it does output is java heap space error. Nothing more.

Any help would be appreciated.

Thanks
Subash



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Java heap space error

Posted by Subash D'Souza <sd...@truecar.com>.

I'm running CDH 4 on  a 4 node cluster each with 96 G of RAM. Up until last week the cluster was running until there was an error in the name node log file and I had to reformat it put the data back

Now when I run hive on YARN. I keep getting a Java heap space error. Based on the research I did. I upped the my mapred.child.java.opts first from 200m to 400 m to 800m and I still have the same issue. It seems to fail near the 100% mapper mark

I checked the log files and the only thing that it does output is java heap space error. Nothing more.

Any help would be appreciated.

Thanks
Subash

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Thanks for the long discussion Mile. Learned a lot from you.

regards,
Lin

On Tue, Oct 23, 2012 at 11:57 AM, Michael Segel
<mi...@hotmail.com>wrote:

> Yup.
> The counters at the end of the job are the most accurate.
>
> On Oct 22, 2012, at 3:00 AM, Lin Ma <li...@gmail.com> wrote:
>
> Thanks for the help so much, Mike. I learned a lot from this discussion.
>
> So, the conclusion I learned from the discussion should be, since how/when
> JT merge counter in the middle of the process of a job is undefined and
> internal behavior, it is more reliable to read counter after the whole job
> completes? Agree?
>
> regards,
> Lin
>
> On Sun, Oct 21, 2012 at 8:15 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>>
>> On Oct 21, 2012, at 1:45 AM, Lin Ma <li...@gmail.com> wrote:
>>
>> Thanks for the detailed reply, Mike. Yes, my most confusion is resolved
>> by you. The last two questions (or comments) are used to confirm my
>> understanding is correct,
>>
>> - is it normal use case or best practices for a job to consume/read the
>> counters from previous completed job in an automatic way? I ask this
>> because I am not sure whether the most use case of counter is human read
>> and manual analysis, other then using another job to automatic consume the
>> counters?
>>
>>
>> Lin,
>> Every job has a set of counters to maintain job statistics.
>> This is specifically for human analysis and to help understand what
>> happened with your job.
>> It allows you to see how much data is read in by the job, how many
>> records processed to be measured against how long the job took to complete.
>>  It also showed you how much data is written back out.
>>
>> In addition to this,  a set of use cases for counters in Hadoop center on
>> quality control. Its normal to chain jobs together to form a job flow.
>> A typical use case for Hadoop is to pull data from various sources,
>> combine them and do some process on them, resulting in a data set that gets
>> sent to another system for visualization.
>>
>> In this use case, there are usually data cleansing and validation jobs.
>> As they run, its possible to track a number of defective records. At the
>> end of that specific job, from the ToolRunner, or whichever job class you
>> used to launch your job, you can then get these aggregated counters for the
>> job and determine if the process passed or failed.  Based on this, you can
>> exit your program with either a success or failed flag.  Job Flow control
>> tools like Oozie can capture this and then decide to continue or to stop
>> and alert an operator of an error.
>>
>> - I want to confirm my understanding is correct, when each task
>> completes, JT will aggregate/update the global counter values from the
>> specific counter values updated by the complete task, but never expose
>> global counters values until job completes? If it is correct, I am
>> wondering why JT doing aggregation each time when a task completes, other
>> than doing a one time aggregation when the job completes? Is there any
>> design choice reasons? thanks.
>>
>>
>> That's a good question. I haven't looked at the code, so I can't say
>> definitively when the JT performs its aggregation. However, as the job runs
>> and in process, we can look at the job tracker web page(s) and see the
>> counter summary. This would imply that there has to be some aggregation
>> occurring mid-flight. (It would be trivial to sum the list of counters
>> periodically to update the job statistics.)  Note too that if the JT web
>> pages can show a counter, its possible to then write a monitoring tool that
>> can monitor the job while running and then kill the job mid flight if a
>> certain threshold of a counter is met.
>>
>> That is to say you could in theory write a monitoring process and watch
>> the counters. If lets say an error counter hits a predetermined threshold,
>> you could then issue a 'hadoop job -kill <job-id>' command.
>>
>>
>> regards,
>> Lin
>>
>> On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel <michael_segel@hotmail.com
>> > wrote:
>>
>>>
>>> On Oct 19, 2012, at 10:27 PM, Lin Ma <li...@gmail.com> wrote:
>>>
>>> Thanks for the detailed reply Mike, I learned a lot from the discussion.
>>>
>>> - I just want to confirm with you that, supposing in the same job, when
>>> a specific task completed (and counter is aggregated in JT after the task
>>> completed from our discussion?), the other running task in the same job
>>> cannot get the updated counter value from the previous completed task? I am
>>> asking this because I am thinking whether I can use counter to share a
>>> global value between tasks.
>>>
>>>
>>> Yes that is correct.
>>> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy
>>> way for a task to query the job tracker. This might have changed in YARN
>>>
>>> - If so, what is the traditional use case of counter, only use counter
>>> values after the whole job completes?
>>>
>>> Yes the counters are used to provide data at the end of the job...
>>>
>>> BTW: appreciate if you could share me a few use cases from your
>>> experience about how counters are used.
>>>
>>> Well you have your typical job data like the number of records
>>> processed, total number of bytes read,  bytes written...
>>>
>>> But suppose you wanted to do some quality control on your input.
>>> So you need to keep a track on the count of bad records.  If this job is
>>> part of a process, you may want to include business logic in your job to
>>> halt the job flow if X% of the records contain bad data.
>>>
>>> Or your process takes input records and in processing them, they sort
>>> the records based on some characteristic and you want to count those sorted
>>> records as you processed them.
>>>
>>> For a more concrete example, the Illinois Tollway has these 'fast pass'
>>> lanes where cars equipped with RFID tags can have the tolls automatically
>>> deducted from their accounts rather than pay the toll manually each time.
>>>
>>> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes
>>> are cheaters where they drive through the sensor and the sensor doesn't
>>> capture the RFID tag. (Note its possible that you have a false positive
>>> where the car has an RFID chip but doesn't trip the sensor.) Pushing the
>>> data in a map/reduce job would require the use of counters.
>>>
>>> Does that help?
>>>
>>> -Mike
>>>
>>> regards,
>>> Lin
>>>
>>> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>> Yeah, sorry...
>>>>
>>>> I meant that if you were dynamically creating a counter foo in the
>>>> Mapper task, then each mapper would be creating their own counter foo.
>>>> As the job runs, these counters will eventually be sent up to the JT.
>>>> The job tracker would keep a separate counter for each task.
>>>>
>>>> At the end, the final count is aggregated from the list of counters for
>>>> foo.
>>>>
>>>>
>>>> I don't know how you can get a task to ask information from the Job
>>>> Tracker on how things are going in other tasks.  That is what I meant that
>>>> you couldn't get information about the other counters or even the status of
>>>> the other tasks running in the same job.
>>>>
>>>> I didn't see anything in the APIs that allowed for that type of flow...
>>>> Of course having said that... someone pops up with a way to do just that.
>>>> ;-)
>>>>
>>>>
>>>> Does that clarify things?
>>>>
>>>> -Mike
>>>>
>>>>
>>>> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
>>>>
>>>> Hi Mike,
>>>>
>>>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>>>>
>>>> From your this statement "It would make sense that the JT maintains a
>>>> unique counter for each task until the tasks complete." -- it seems each
>>>> task cannot see counters from each other, since JT maintains a unique
>>>> counter for each tasks;
>>>>
>>>> From your this comment "I meant that if a Task created and updated a
>>>> counter, a different Task has access to that counter. " -- it seems
>>>> different tasks could share/access the same counter.
>>>>
>>>> Appreciate if you could help to clarify a bit.
>>>>
>>>> regards,
>>>> Lin
>>>>
>>>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <
>>>> michael_segel@hotmail.com> wrote:
>>>>
>>>>>
>>>>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>>>>>
>>>>> Hi Mike,
>>>>>
>>>>> Thanks for the detailed reply. Two quick questions/comments,
>>>>>
>>>>> 1. For "task", you mean a specific mapper instance, or a specific
>>>>> reducer instance?
>>>>>
>>>>>
>>>>> Either.
>>>>>
>>>>> 2. "However, I do not believe that a separate Task could connect with
>>>>> the JT and see if the counter exists or if it could get a value or even an
>>>>> accurate value since the updates are asynchronous." -- do you mean if a
>>>>> mapper is updating custom counter ABC, and another mapper is updating the
>>>>> same customer counter ABC, their counter values are updated independently
>>>>> by different mappers, and will not published (aggregated) externally until
>>>>> job completed successfully?
>>>>>
>>>>> I meant that if a Task created and updated a counter, a different Task
>>>>> has access to that counter.
>>>>>
>>>>> To give you an example, if I want to count the number of quality
>>>>> errors and then fail after X number of errors, I can't use Global counters
>>>>> to do this.
>>>>>
>>>>> regards,
>>>>> Lin
>>>>>
>>>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <
>>>>> michael_segel@hotmail.com> wrote:
>>>>>
>>>>>> As I understand it... each Task has its own counters and are
>>>>>> independently updated. As they report back to the JT, they update the
>>>>>> counter(s)' status.
>>>>>> The JT then will aggregate them.
>>>>>>
>>>>>> In terms of performance, Counters take up some memory in the JT so
>>>>>> while its OK to use them, if you abuse them, you can run in to issues.
>>>>>> As to limits... I guess that will depend on the amount of memory on
>>>>>> the JT machine, the size of the cluster (Number of TT) and the number of
>>>>>> counters.
>>>>>>
>>>>>> In terms of global accessibility... Maybe.
>>>>>>
>>>>>> The reason I say maybe is that I'm not sure by what you mean by
>>>>>> globally accessible.
>>>>>> If a task creates and implements a dynamic counter... I know that it
>>>>>> will eventually be reflected in the JT. However, I do not believe that a
>>>>>> separate Task could connect with the JT and see if the counter exists or if
>>>>>> it could get a value or even an accurate value since the updates are
>>>>>> asynchronous.  Not to mention that I don't believe that the counters are
>>>>>> aggregated until the job ends. It would make sense that the JT maintains a
>>>>>> unique counter for each task until the tasks complete. (If a task fails, it
>>>>>> would have to delete the counters so that when the task is restarted the
>>>>>> correct count is maintained. )  Note, I haven't looked at the source code
>>>>>> so I am probably wrong.
>>>>>>
>>>>>> HTH
>>>>>> Mike
>>>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> I have some quick questions regarding to Hadoop counter,
>>>>>>
>>>>>>
>>>>>>    - Hadoop counter (customer defined) is global accessible (for
>>>>>>    both read and write) for all Mappers and Reducers in a job?
>>>>>>    - What is the performance and best practices of using Hadoop
>>>>>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>>>>>    performance downgrade to the whole job?
>>>>>>
>>>>>> regards,
>>>>>> Lin
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Thanks for the long discussion Mile. Learned a lot from you.

regards,
Lin

On Tue, Oct 23, 2012 at 11:57 AM, Michael Segel
<mi...@hotmail.com>wrote:

> Yup.
> The counters at the end of the job are the most accurate.
>
> On Oct 22, 2012, at 3:00 AM, Lin Ma <li...@gmail.com> wrote:
>
> Thanks for the help so much, Mike. I learned a lot from this discussion.
>
> So, the conclusion I learned from the discussion should be, since how/when
> JT merge counter in the middle of the process of a job is undefined and
> internal behavior, it is more reliable to read counter after the whole job
> completes? Agree?
>
> regards,
> Lin
>
> On Sun, Oct 21, 2012 at 8:15 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>>
>> On Oct 21, 2012, at 1:45 AM, Lin Ma <li...@gmail.com> wrote:
>>
>> Thanks for the detailed reply, Mike. Yes, my most confusion is resolved
>> by you. The last two questions (or comments) are used to confirm my
>> understanding is correct,
>>
>> - is it normal use case or best practices for a job to consume/read the
>> counters from previous completed job in an automatic way? I ask this
>> because I am not sure whether the most use case of counter is human read
>> and manual analysis, other then using another job to automatic consume the
>> counters?
>>
>>
>> Lin,
>> Every job has a set of counters to maintain job statistics.
>> This is specifically for human analysis and to help understand what
>> happened with your job.
>> It allows you to see how much data is read in by the job, how many
>> records processed to be measured against how long the job took to complete.
>>  It also showed you how much data is written back out.
>>
>> In addition to this,  a set of use cases for counters in Hadoop center on
>> quality control. Its normal to chain jobs together to form a job flow.
>> A typical use case for Hadoop is to pull data from various sources,
>> combine them and do some process on them, resulting in a data set that gets
>> sent to another system for visualization.
>>
>> In this use case, there are usually data cleansing and validation jobs.
>> As they run, its possible to track a number of defective records. At the
>> end of that specific job, from the ToolRunner, or whichever job class you
>> used to launch your job, you can then get these aggregated counters for the
>> job and determine if the process passed or failed.  Based on this, you can
>> exit your program with either a success or failed flag.  Job Flow control
>> tools like Oozie can capture this and then decide to continue or to stop
>> and alert an operator of an error.
>>
>> - I want to confirm my understanding is correct, when each task
>> completes, JT will aggregate/update the global counter values from the
>> specific counter values updated by the complete task, but never expose
>> global counters values until job completes? If it is correct, I am
>> wondering why JT doing aggregation each time when a task completes, other
>> than doing a one time aggregation when the job completes? Is there any
>> design choice reasons? thanks.
>>
>>
>> That's a good question. I haven't looked at the code, so I can't say
>> definitively when the JT performs its aggregation. However, as the job runs
>> and in process, we can look at the job tracker web page(s) and see the
>> counter summary. This would imply that there has to be some aggregation
>> occurring mid-flight. (It would be trivial to sum the list of counters
>> periodically to update the job statistics.)  Note too that if the JT web
>> pages can show a counter, its possible to then write a monitoring tool that
>> can monitor the job while running and then kill the job mid flight if a
>> certain threshold of a counter is met.
>>
>> That is to say you could in theory write a monitoring process and watch
>> the counters. If lets say an error counter hits a predetermined threshold,
>> you could then issue a 'hadoop job -kill <job-id>' command.
>>
>>
>> regards,
>> Lin
>>
>> On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel <michael_segel@hotmail.com
>> > wrote:
>>
>>>
>>> On Oct 19, 2012, at 10:27 PM, Lin Ma <li...@gmail.com> wrote:
>>>
>>> Thanks for the detailed reply Mike, I learned a lot from the discussion.
>>>
>>> - I just want to confirm with you that, supposing in the same job, when
>>> a specific task completed (and counter is aggregated in JT after the task
>>> completed from our discussion?), the other running task in the same job
>>> cannot get the updated counter value from the previous completed task? I am
>>> asking this because I am thinking whether I can use counter to share a
>>> global value between tasks.
>>>
>>>
>>> Yes that is correct.
>>> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy
>>> way for a task to query the job tracker. This might have changed in YARN
>>>
>>> - If so, what is the traditional use case of counter, only use counter
>>> values after the whole job completes?
>>>
>>> Yes the counters are used to provide data at the end of the job...
>>>
>>> BTW: appreciate if you could share me a few use cases from your
>>> experience about how counters are used.
>>>
>>> Well you have your typical job data like the number of records
>>> processed, total number of bytes read,  bytes written...
>>>
>>> But suppose you wanted to do some quality control on your input.
>>> So you need to keep a track on the count of bad records.  If this job is
>>> part of a process, you may want to include business logic in your job to
>>> halt the job flow if X% of the records contain bad data.
>>>
>>> Or your process takes input records and in processing them, they sort
>>> the records based on some characteristic and you want to count those sorted
>>> records as you processed them.
>>>
>>> For a more concrete example, the Illinois Tollway has these 'fast pass'
>>> lanes where cars equipped with RFID tags can have the tolls automatically
>>> deducted from their accounts rather than pay the toll manually each time.
>>>
>>> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes
>>> are cheaters where they drive through the sensor and the sensor doesn't
>>> capture the RFID tag. (Note its possible that you have a false positive
>>> where the car has an RFID chip but doesn't trip the sensor.) Pushing the
>>> data in a map/reduce job would require the use of counters.
>>>
>>> Does that help?
>>>
>>> -Mike
>>>
>>> regards,
>>> Lin
>>>
>>> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>> Yeah, sorry...
>>>>
>>>> I meant that if you were dynamically creating a counter foo in the
>>>> Mapper task, then each mapper would be creating their own counter foo.
>>>> As the job runs, these counters will eventually be sent up to the JT.
>>>> The job tracker would keep a separate counter for each task.
>>>>
>>>> At the end, the final count is aggregated from the list of counters for
>>>> foo.
>>>>
>>>>
>>>> I don't know how you can get a task to ask information from the Job
>>>> Tracker on how things are going in other tasks.  That is what I meant that
>>>> you couldn't get information about the other counters or even the status of
>>>> the other tasks running in the same job.
>>>>
>>>> I didn't see anything in the APIs that allowed for that type of flow...
>>>> Of course having said that... someone pops up with a way to do just that.
>>>> ;-)
>>>>
>>>>
>>>> Does that clarify things?
>>>>
>>>> -Mike
>>>>
>>>>
>>>> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
>>>>
>>>> Hi Mike,
>>>>
>>>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>>>>
>>>> From your this statement "It would make sense that the JT maintains a
>>>> unique counter for each task until the tasks complete." -- it seems each
>>>> task cannot see counters from each other, since JT maintains a unique
>>>> counter for each tasks;
>>>>
>>>> From your this comment "I meant that if a Task created and updated a
>>>> counter, a different Task has access to that counter. " -- it seems
>>>> different tasks could share/access the same counter.
>>>>
>>>> Appreciate if you could help to clarify a bit.
>>>>
>>>> regards,
>>>> Lin
>>>>
>>>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <
>>>> michael_segel@hotmail.com> wrote:
>>>>
>>>>>
>>>>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>>>>>
>>>>> Hi Mike,
>>>>>
>>>>> Thanks for the detailed reply. Two quick questions/comments,
>>>>>
>>>>> 1. For "task", you mean a specific mapper instance, or a specific
>>>>> reducer instance?
>>>>>
>>>>>
>>>>> Either.
>>>>>
>>>>> 2. "However, I do not believe that a separate Task could connect with
>>>>> the JT and see if the counter exists or if it could get a value or even an
>>>>> accurate value since the updates are asynchronous." -- do you mean if a
>>>>> mapper is updating custom counter ABC, and another mapper is updating the
>>>>> same customer counter ABC, their counter values are updated independently
>>>>> by different mappers, and will not published (aggregated) externally until
>>>>> job completed successfully?
>>>>>
>>>>> I meant that if a Task created and updated a counter, a different Task
>>>>> has access to that counter.
>>>>>
>>>>> To give you an example, if I want to count the number of quality
>>>>> errors and then fail after X number of errors, I can't use Global counters
>>>>> to do this.
>>>>>
>>>>> regards,
>>>>> Lin
>>>>>
>>>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <
>>>>> michael_segel@hotmail.com> wrote:
>>>>>
>>>>>> As I understand it... each Task has its own counters and are
>>>>>> independently updated. As they report back to the JT, they update the
>>>>>> counter(s)' status.
>>>>>> The JT then will aggregate them.
>>>>>>
>>>>>> In terms of performance, Counters take up some memory in the JT so
>>>>>> while its OK to use them, if you abuse them, you can run in to issues.
>>>>>> As to limits... I guess that will depend on the amount of memory on
>>>>>> the JT machine, the size of the cluster (Number of TT) and the number of
>>>>>> counters.
>>>>>>
>>>>>> In terms of global accessibility... Maybe.
>>>>>>
>>>>>> The reason I say maybe is that I'm not sure by what you mean by
>>>>>> globally accessible.
>>>>>> If a task creates and implements a dynamic counter... I know that it
>>>>>> will eventually be reflected in the JT. However, I do not believe that a
>>>>>> separate Task could connect with the JT and see if the counter exists or if
>>>>>> it could get a value or even an accurate value since the updates are
>>>>>> asynchronous.  Not to mention that I don't believe that the counters are
>>>>>> aggregated until the job ends. It would make sense that the JT maintains a
>>>>>> unique counter for each task until the tasks complete. (If a task fails, it
>>>>>> would have to delete the counters so that when the task is restarted the
>>>>>> correct count is maintained. )  Note, I haven't looked at the source code
>>>>>> so I am probably wrong.
>>>>>>
>>>>>> HTH
>>>>>> Mike
>>>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> I have some quick questions regarding to Hadoop counter,
>>>>>>
>>>>>>
>>>>>>    - Hadoop counter (customer defined) is global accessible (for
>>>>>>    both read and write) for all Mappers and Reducers in a job?
>>>>>>    - What is the performance and best practices of using Hadoop
>>>>>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>>>>>    performance downgrade to the whole job?
>>>>>>
>>>>>> regards,
>>>>>> Lin
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Thanks for the long discussion Mile. Learned a lot from you.

regards,
Lin

On Tue, Oct 23, 2012 at 11:57 AM, Michael Segel
<mi...@hotmail.com>wrote:

> Yup.
> The counters at the end of the job are the most accurate.
>
> On Oct 22, 2012, at 3:00 AM, Lin Ma <li...@gmail.com> wrote:
>
> Thanks for the help so much, Mike. I learned a lot from this discussion.
>
> So, the conclusion I learned from the discussion should be, since how/when
> JT merge counter in the middle of the process of a job is undefined and
> internal behavior, it is more reliable to read counter after the whole job
> completes? Agree?
>
> regards,
> Lin
>
> On Sun, Oct 21, 2012 at 8:15 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>>
>> On Oct 21, 2012, at 1:45 AM, Lin Ma <li...@gmail.com> wrote:
>>
>> Thanks for the detailed reply, Mike. Yes, my most confusion is resolved
>> by you. The last two questions (or comments) are used to confirm my
>> understanding is correct,
>>
>> - is it normal use case or best practices for a job to consume/read the
>> counters from previous completed job in an automatic way? I ask this
>> because I am not sure whether the most use case of counter is human read
>> and manual analysis, other then using another job to automatic consume the
>> counters?
>>
>>
>> Lin,
>> Every job has a set of counters to maintain job statistics.
>> This is specifically for human analysis and to help understand what
>> happened with your job.
>> It allows you to see how much data is read in by the job, how many
>> records processed to be measured against how long the job took to complete.
>>  It also showed you how much data is written back out.
>>
>> In addition to this,  a set of use cases for counters in Hadoop center on
>> quality control. Its normal to chain jobs together to form a job flow.
>> A typical use case for Hadoop is to pull data from various sources,
>> combine them and do some process on them, resulting in a data set that gets
>> sent to another system for visualization.
>>
>> In this use case, there are usually data cleansing and validation jobs.
>> As they run, its possible to track a number of defective records. At the
>> end of that specific job, from the ToolRunner, or whichever job class you
>> used to launch your job, you can then get these aggregated counters for the
>> job and determine if the process passed or failed.  Based on this, you can
>> exit your program with either a success or failed flag.  Job Flow control
>> tools like Oozie can capture this and then decide to continue or to stop
>> and alert an operator of an error.
>>
>> - I want to confirm my understanding is correct, when each task
>> completes, JT will aggregate/update the global counter values from the
>> specific counter values updated by the complete task, but never expose
>> global counters values until job completes? If it is correct, I am
>> wondering why JT doing aggregation each time when a task completes, other
>> than doing a one time aggregation when the job completes? Is there any
>> design choice reasons? thanks.
>>
>>
>> That's a good question. I haven't looked at the code, so I can't say
>> definitively when the JT performs its aggregation. However, as the job runs
>> and in process, we can look at the job tracker web page(s) and see the
>> counter summary. This would imply that there has to be some aggregation
>> occurring mid-flight. (It would be trivial to sum the list of counters
>> periodically to update the job statistics.)  Note too that if the JT web
>> pages can show a counter, its possible to then write a monitoring tool that
>> can monitor the job while running and then kill the job mid flight if a
>> certain threshold of a counter is met.
>>
>> That is to say you could in theory write a monitoring process and watch
>> the counters. If lets say an error counter hits a predetermined threshold,
>> you could then issue a 'hadoop job -kill <job-id>' command.
>>
>>
>> regards,
>> Lin
>>
>> On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel <michael_segel@hotmail.com
>> > wrote:
>>
>>>
>>> On Oct 19, 2012, at 10:27 PM, Lin Ma <li...@gmail.com> wrote:
>>>
>>> Thanks for the detailed reply Mike, I learned a lot from the discussion.
>>>
>>> - I just want to confirm with you that, supposing in the same job, when
>>> a specific task completed (and counter is aggregated in JT after the task
>>> completed from our discussion?), the other running task in the same job
>>> cannot get the updated counter value from the previous completed task? I am
>>> asking this because I am thinking whether I can use counter to share a
>>> global value between tasks.
>>>
>>>
>>> Yes that is correct.
>>> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy
>>> way for a task to query the job tracker. This might have changed in YARN
>>>
>>> - If so, what is the traditional use case of counter, only use counter
>>> values after the whole job completes?
>>>
>>> Yes the counters are used to provide data at the end of the job...
>>>
>>> BTW: appreciate if you could share me a few use cases from your
>>> experience about how counters are used.
>>>
>>> Well you have your typical job data like the number of records
>>> processed, total number of bytes read,  bytes written...
>>>
>>> But suppose you wanted to do some quality control on your input.
>>> So you need to keep a track on the count of bad records.  If this job is
>>> part of a process, you may want to include business logic in your job to
>>> halt the job flow if X% of the records contain bad data.
>>>
>>> Or your process takes input records and in processing them, they sort
>>> the records based on some characteristic and you want to count those sorted
>>> records as you processed them.
>>>
>>> For a more concrete example, the Illinois Tollway has these 'fast pass'
>>> lanes where cars equipped with RFID tags can have the tolls automatically
>>> deducted from their accounts rather than pay the toll manually each time.
>>>
>>> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes
>>> are cheaters where they drive through the sensor and the sensor doesn't
>>> capture the RFID tag. (Note its possible that you have a false positive
>>> where the car has an RFID chip but doesn't trip the sensor.) Pushing the
>>> data in a map/reduce job would require the use of counters.
>>>
>>> Does that help?
>>>
>>> -Mike
>>>
>>> regards,
>>> Lin
>>>
>>> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>> Yeah, sorry...
>>>>
>>>> I meant that if you were dynamically creating a counter foo in the
>>>> Mapper task, then each mapper would be creating their own counter foo.
>>>> As the job runs, these counters will eventually be sent up to the JT.
>>>> The job tracker would keep a separate counter for each task.
>>>>
>>>> At the end, the final count is aggregated from the list of counters for
>>>> foo.
>>>>
>>>>
>>>> I don't know how you can get a task to ask information from the Job
>>>> Tracker on how things are going in other tasks.  That is what I meant that
>>>> you couldn't get information about the other counters or even the status of
>>>> the other tasks running in the same job.
>>>>
>>>> I didn't see anything in the APIs that allowed for that type of flow...
>>>> Of course having said that... someone pops up with a way to do just that.
>>>> ;-)
>>>>
>>>>
>>>> Does that clarify things?
>>>>
>>>> -Mike
>>>>
>>>>
>>>> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
>>>>
>>>> Hi Mike,
>>>>
>>>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>>>>
>>>> From your this statement "It would make sense that the JT maintains a
>>>> unique counter for each task until the tasks complete." -- it seems each
>>>> task cannot see counters from each other, since JT maintains a unique
>>>> counter for each tasks;
>>>>
>>>> From your this comment "I meant that if a Task created and updated a
>>>> counter, a different Task has access to that counter. " -- it seems
>>>> different tasks could share/access the same counter.
>>>>
>>>> Appreciate if you could help to clarify a bit.
>>>>
>>>> regards,
>>>> Lin
>>>>
>>>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <
>>>> michael_segel@hotmail.com> wrote:
>>>>
>>>>>
>>>>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>>>>>
>>>>> Hi Mike,
>>>>>
>>>>> Thanks for the detailed reply. Two quick questions/comments,
>>>>>
>>>>> 1. For "task", you mean a specific mapper instance, or a specific
>>>>> reducer instance?
>>>>>
>>>>>
>>>>> Either.
>>>>>
>>>>> 2. "However, I do not believe that a separate Task could connect with
>>>>> the JT and see if the counter exists or if it could get a value or even an
>>>>> accurate value since the updates are asynchronous." -- do you mean if a
>>>>> mapper is updating custom counter ABC, and another mapper is updating the
>>>>> same customer counter ABC, their counter values are updated independently
>>>>> by different mappers, and will not published (aggregated) externally until
>>>>> job completed successfully?
>>>>>
>>>>> I meant that if a Task created and updated a counter, a different Task
>>>>> has access to that counter.
>>>>>
>>>>> To give you an example, if I want to count the number of quality
>>>>> errors and then fail after X number of errors, I can't use Global counters
>>>>> to do this.
>>>>>
>>>>> regards,
>>>>> Lin
>>>>>
>>>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <
>>>>> michael_segel@hotmail.com> wrote:
>>>>>
>>>>>> As I understand it... each Task has its own counters and are
>>>>>> independently updated. As they report back to the JT, they update the
>>>>>> counter(s)' status.
>>>>>> The JT then will aggregate them.
>>>>>>
>>>>>> In terms of performance, Counters take up some memory in the JT so
>>>>>> while its OK to use them, if you abuse them, you can run in to issues.
>>>>>> As to limits... I guess that will depend on the amount of memory on
>>>>>> the JT machine, the size of the cluster (Number of TT) and the number of
>>>>>> counters.
>>>>>>
>>>>>> In terms of global accessibility... Maybe.
>>>>>>
>>>>>> The reason I say maybe is that I'm not sure by what you mean by
>>>>>> globally accessible.
>>>>>> If a task creates and implements a dynamic counter... I know that it
>>>>>> will eventually be reflected in the JT. However, I do not believe that a
>>>>>> separate Task could connect with the JT and see if the counter exists or if
>>>>>> it could get a value or even an accurate value since the updates are
>>>>>> asynchronous.  Not to mention that I don't believe that the counters are
>>>>>> aggregated until the job ends. It would make sense that the JT maintains a
>>>>>> unique counter for each task until the tasks complete. (If a task fails, it
>>>>>> would have to delete the counters so that when the task is restarted the
>>>>>> correct count is maintained. )  Note, I haven't looked at the source code
>>>>>> so I am probably wrong.
>>>>>>
>>>>>> HTH
>>>>>> Mike
>>>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> I have some quick questions regarding to Hadoop counter,
>>>>>>
>>>>>>
>>>>>>    - Hadoop counter (customer defined) is global accessible (for
>>>>>>    both read and write) for all Mappers and Reducers in a job?
>>>>>>    - What is the performance and best practices of using Hadoop
>>>>>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>>>>>    performance downgrade to the whole job?
>>>>>>
>>>>>> regards,
>>>>>> Lin
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Thanks for the long discussion Mile. Learned a lot from you.

regards,
Lin

On Tue, Oct 23, 2012 at 11:57 AM, Michael Segel
<mi...@hotmail.com>wrote:

> Yup.
> The counters at the end of the job are the most accurate.
>
> On Oct 22, 2012, at 3:00 AM, Lin Ma <li...@gmail.com> wrote:
>
> Thanks for the help so much, Mike. I learned a lot from this discussion.
>
> So, the conclusion I learned from the discussion should be, since how/when
> JT merge counter in the middle of the process of a job is undefined and
> internal behavior, it is more reliable to read counter after the whole job
> completes? Agree?
>
> regards,
> Lin
>
> On Sun, Oct 21, 2012 at 8:15 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>>
>> On Oct 21, 2012, at 1:45 AM, Lin Ma <li...@gmail.com> wrote:
>>
>> Thanks for the detailed reply, Mike. Yes, my most confusion is resolved
>> by you. The last two questions (or comments) are used to confirm my
>> understanding is correct,
>>
>> - is it normal use case or best practices for a job to consume/read the
>> counters from previous completed job in an automatic way? I ask this
>> because I am not sure whether the most use case of counter is human read
>> and manual analysis, other then using another job to automatic consume the
>> counters?
>>
>>
>> Lin,
>> Every job has a set of counters to maintain job statistics.
>> This is specifically for human analysis and to help understand what
>> happened with your job.
>> It allows you to see how much data is read in by the job, how many
>> records processed to be measured against how long the job took to complete.
>>  It also showed you how much data is written back out.
>>
>> In addition to this,  a set of use cases for counters in Hadoop center on
>> quality control. Its normal to chain jobs together to form a job flow.
>> A typical use case for Hadoop is to pull data from various sources,
>> combine them and do some process on them, resulting in a data set that gets
>> sent to another system for visualization.
>>
>> In this use case, there are usually data cleansing and validation jobs.
>> As they run, its possible to track a number of defective records. At the
>> end of that specific job, from the ToolRunner, or whichever job class you
>> used to launch your job, you can then get these aggregated counters for the
>> job and determine if the process passed or failed.  Based on this, you can
>> exit your program with either a success or failed flag.  Job Flow control
>> tools like Oozie can capture this and then decide to continue or to stop
>> and alert an operator of an error.
>>
>> - I want to confirm my understanding is correct, when each task
>> completes, JT will aggregate/update the global counter values from the
>> specific counter values updated by the complete task, but never expose
>> global counters values until job completes? If it is correct, I am
>> wondering why JT doing aggregation each time when a task completes, other
>> than doing a one time aggregation when the job completes? Is there any
>> design choice reasons? thanks.
>>
>>
>> That's a good question. I haven't looked at the code, so I can't say
>> definitively when the JT performs its aggregation. However, as the job runs
>> and in process, we can look at the job tracker web page(s) and see the
>> counter summary. This would imply that there has to be some aggregation
>> occurring mid-flight. (It would be trivial to sum the list of counters
>> periodically to update the job statistics.)  Note too that if the JT web
>> pages can show a counter, its possible to then write a monitoring tool that
>> can monitor the job while running and then kill the job mid flight if a
>> certain threshold of a counter is met.
>>
>> That is to say you could in theory write a monitoring process and watch
>> the counters. If lets say an error counter hits a predetermined threshold,
>> you could then issue a 'hadoop job -kill <job-id>' command.
>>
>>
>> regards,
>> Lin
>>
>> On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel <michael_segel@hotmail.com
>> > wrote:
>>
>>>
>>> On Oct 19, 2012, at 10:27 PM, Lin Ma <li...@gmail.com> wrote:
>>>
>>> Thanks for the detailed reply Mike, I learned a lot from the discussion.
>>>
>>> - I just want to confirm with you that, supposing in the same job, when
>>> a specific task completed (and counter is aggregated in JT after the task
>>> completed from our discussion?), the other running task in the same job
>>> cannot get the updated counter value from the previous completed task? I am
>>> asking this because I am thinking whether I can use counter to share a
>>> global value between tasks.
>>>
>>>
>>> Yes that is correct.
>>> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy
>>> way for a task to query the job tracker. This might have changed in YARN
>>>
>>> - If so, what is the traditional use case of counter, only use counter
>>> values after the whole job completes?
>>>
>>> Yes the counters are used to provide data at the end of the job...
>>>
>>> BTW: appreciate if you could share me a few use cases from your
>>> experience about how counters are used.
>>>
>>> Well you have your typical job data like the number of records
>>> processed, total number of bytes read,  bytes written...
>>>
>>> But suppose you wanted to do some quality control on your input.
>>> So you need to keep a track on the count of bad records.  If this job is
>>> part of a process, you may want to include business logic in your job to
>>> halt the job flow if X% of the records contain bad data.
>>>
>>> Or your process takes input records and in processing them, they sort
>>> the records based on some characteristic and you want to count those sorted
>>> records as you processed them.
>>>
>>> For a more concrete example, the Illinois Tollway has these 'fast pass'
>>> lanes where cars equipped with RFID tags can have the tolls automatically
>>> deducted from their accounts rather than pay the toll manually each time.
>>>
>>> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes
>>> are cheaters where they drive through the sensor and the sensor doesn't
>>> capture the RFID tag. (Note its possible that you have a false positive
>>> where the car has an RFID chip but doesn't trip the sensor.) Pushing the
>>> data in a map/reduce job would require the use of counters.
>>>
>>> Does that help?
>>>
>>> -Mike
>>>
>>> regards,
>>> Lin
>>>
>>> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>> Yeah, sorry...
>>>>
>>>> I meant that if you were dynamically creating a counter foo in the
>>>> Mapper task, then each mapper would be creating their own counter foo.
>>>> As the job runs, these counters will eventually be sent up to the JT.
>>>> The job tracker would keep a separate counter for each task.
>>>>
>>>> At the end, the final count is aggregated from the list of counters for
>>>> foo.
>>>>
>>>>
>>>> I don't know how you can get a task to ask information from the Job
>>>> Tracker on how things are going in other tasks.  That is what I meant that
>>>> you couldn't get information about the other counters or even the status of
>>>> the other tasks running in the same job.
>>>>
>>>> I didn't see anything in the APIs that allowed for that type of flow...
>>>> Of course having said that... someone pops up with a way to do just that.
>>>> ;-)
>>>>
>>>>
>>>> Does that clarify things?
>>>>
>>>> -Mike
>>>>
>>>>
>>>> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
>>>>
>>>> Hi Mike,
>>>>
>>>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>>>>
>>>> From your this statement "It would make sense that the JT maintains a
>>>> unique counter for each task until the tasks complete." -- it seems each
>>>> task cannot see counters from each other, since JT maintains a unique
>>>> counter for each tasks;
>>>>
>>>> From your this comment "I meant that if a Task created and updated a
>>>> counter, a different Task has access to that counter. " -- it seems
>>>> different tasks could share/access the same counter.
>>>>
>>>> Appreciate if you could help to clarify a bit.
>>>>
>>>> regards,
>>>> Lin
>>>>
>>>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <
>>>> michael_segel@hotmail.com> wrote:
>>>>
>>>>>
>>>>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>>>>>
>>>>> Hi Mike,
>>>>>
>>>>> Thanks for the detailed reply. Two quick questions/comments,
>>>>>
>>>>> 1. For "task", you mean a specific mapper instance, or a specific
>>>>> reducer instance?
>>>>>
>>>>>
>>>>> Either.
>>>>>
>>>>> 2. "However, I do not believe that a separate Task could connect with
>>>>> the JT and see if the counter exists or if it could get a value or even an
>>>>> accurate value since the updates are asynchronous." -- do you mean if a
>>>>> mapper is updating custom counter ABC, and another mapper is updating the
>>>>> same customer counter ABC, their counter values are updated independently
>>>>> by different mappers, and will not published (aggregated) externally until
>>>>> job completed successfully?
>>>>>
>>>>> I meant that if a Task created and updated a counter, a different Task
>>>>> has access to that counter.
>>>>>
>>>>> To give you an example, if I want to count the number of quality
>>>>> errors and then fail after X number of errors, I can't use Global counters
>>>>> to do this.
>>>>>
>>>>> regards,
>>>>> Lin
>>>>>
>>>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <
>>>>> michael_segel@hotmail.com> wrote:
>>>>>
>>>>>> As I understand it... each Task has its own counters and are
>>>>>> independently updated. As they report back to the JT, they update the
>>>>>> counter(s)' status.
>>>>>> The JT then will aggregate them.
>>>>>>
>>>>>> In terms of performance, Counters take up some memory in the JT so
>>>>>> while its OK to use them, if you abuse them, you can run in to issues.
>>>>>> As to limits... I guess that will depend on the amount of memory on
>>>>>> the JT machine, the size of the cluster (Number of TT) and the number of
>>>>>> counters.
>>>>>>
>>>>>> In terms of global accessibility... Maybe.
>>>>>>
>>>>>> The reason I say maybe is that I'm not sure by what you mean by
>>>>>> globally accessible.
>>>>>> If a task creates and implements a dynamic counter... I know that it
>>>>>> will eventually be reflected in the JT. However, I do not believe that a
>>>>>> separate Task could connect with the JT and see if the counter exists or if
>>>>>> it could get a value or even an accurate value since the updates are
>>>>>> asynchronous.  Not to mention that I don't believe that the counters are
>>>>>> aggregated until the job ends. It would make sense that the JT maintains a
>>>>>> unique counter for each task until the tasks complete. (If a task fails, it
>>>>>> would have to delete the counters so that when the task is restarted the
>>>>>> correct count is maintained. )  Note, I haven't looked at the source code
>>>>>> so I am probably wrong.
>>>>>>
>>>>>> HTH
>>>>>> Mike
>>>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> I have some quick questions regarding to Hadoop counter,
>>>>>>
>>>>>>
>>>>>>    - Hadoop counter (customer defined) is global accessible (for
>>>>>>    both read and write) for all Mappers and Reducers in a job?
>>>>>>    - What is the performance and best practices of using Hadoop
>>>>>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>>>>>    performance downgrade to the whole job?
>>>>>>
>>>>>> regards,
>>>>>> Lin
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: Hadoop counter

Posted by Michael Segel <mi...@hotmail.com>.

Yup. 
The counters at the end of the job are the most accurate. 

On Oct 22, 2012, at 3:00 AM, Lin Ma <li...@gmail.com> wrote:

> Thanks for the help so much, Mike. I learned a lot from this discussion.
> 
> So, the conclusion I learned from the discussion should be, since how/when JT merge counter in the middle of the process of a job is undefined and internal behavior, it is more reliable to read counter after the whole job completes? Agree?
> 
> regards,
> Lin
> 
> On Sun, Oct 21, 2012 at 8:15 PM, Michael Segel <mi...@hotmail.com> wrote:
> 
> On Oct 21, 2012, at 1:45 AM, Lin Ma <li...@gmail.com> wrote:
> 
>> Thanks for the detailed reply, Mike. Yes, my most confusion is resolved by you. The last two questions (or comments) are used to confirm my understanding is correct,
>> 
>> - is it normal use case or best practices for a job to consume/read the counters from previous completed job in an automatic way? I ask this because I am not sure whether the most use case of counter is human read and manual analysis, other then using another job to automatic consume the counters?
> 
> Lin, 
> Every job has a set of counters to maintain job statistics. 
> This is specifically for human analysis and to help understand what happened with your job. 
> It allows you to see how much data is read in by the job, how many records processed to be measured against how long the job took to complete.  It also showed you how much data is written back out.  
> 
> In addition to this,  a set of use cases for counters in Hadoop center on quality control. Its normal to chain jobs together to form a job flow. 
> A typical use case for Hadoop is to pull data from various sources, combine them and do some process on them, resulting in a data set that gets sent to another system for visualization. 
> 
> In this use case, there are usually data cleansing and validation jobs. As they run, its possible to track a number of defective records. At the end of that specific job, from the ToolRunner, or whichever job class you used to launch your job, you can then get these aggregated counters for the job and determine if the process passed or failed.  Based on this, you can exit your program with either a success or failed flag.  Job Flow control tools like Oozie can capture this and then decide to continue or to stop and alert an operator of an error. 
> 
>> - I want to confirm my understanding is correct, when each task completes, JT will aggregate/update the global counter values from the specific counter values updated by the complete task, but never expose global counters values until job completes? If it is correct, I am wondering why JT doing aggregation each time when a task completes, other than doing a one time aggregation when the job completes? Is there any design choice reasons? thanks.
> 
> That's a good question. I haven't looked at the code, so I can't say definitively when the JT performs its aggregation. However, as the job runs and in process, we can look at the job tracker web page(s) and see the counter summary. This would imply that there has to be some aggregation occurring mid-flight. (It would be trivial to sum the list of counters periodically to update the job statistics.)  Note too that if the JT web pages can show a counter, its possible to then write a monitoring tool that can monitor the job while running and then kill the job mid flight if a certain threshold of a counter is met. 
> 
> That is to say you could in theory write a monitoring process and watch the counters. If lets say an error counter hits a predetermined threshold, you could then issue a 'hadoop job -kill <job-id>' command. 
> 
>> 
>> regards,
>> Lin
>> 
>> On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel <mi...@hotmail.com> wrote:
>> 
>> On Oct 19, 2012, at 10:27 PM, Lin Ma <li...@gmail.com> wrote:
>> 
>>> Thanks for the detailed reply Mike, I learned a lot from the discussion.
>>> 
>>> - I just want to confirm with you that, supposing in the same job, when a specific task completed (and counter is aggregated in JT after the task completed from our discussion?), the other running task in the same job cannot get the updated counter value from the previous completed task? I am asking this because I am thinking whether I can use counter to share a global value between tasks.
>> 
>> Yes that is correct. 
>> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy way for a task to query the job tracker. This might have changed in YARN
>> 
>>> - If so, what is the traditional use case of counter, only use counter values after the whole job completes?
>>> 
>> Yes the counters are used to provide data at the end of the job... 
>> 
>>> BTW: appreciate if you could share me a few use cases from your experience about how counters are used.
>>> 
>> Well you have your typical job data like the number of records processed, total number of bytes read,  bytes written... 
>> 
>> But suppose you wanted to do some quality control on your input. 
>> So you need to keep a track on the count of bad records.  If this job is part of a process, you may want to include business logic in your job to halt the job flow if X% of the records contain bad data. 
>> 
>> Or your process takes input records and in processing them, they sort the records based on some characteristic and you want to count those sorted records as you processed them. 
>> 
>> For a more concrete example, the Illinois Tollway has these 'fast pass' lanes where cars equipped with RFID tags can have the tolls automatically deducted from their accounts rather than pay the toll manually each time. 
>> 
>> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are cheaters where they drive through the sensor and the sensor doesn't capture the RFID tag. (Note its possible that you have a false positive where the car has an RFID chip but doesn't trip the sensor.) Pushing the data in a map/reduce job would require the use of counters.
>> 
>> Does that help? 
>> 
>> -Mike
>> 
>>> regards,
>>> Lin
>>> 
>>> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <mi...@hotmail.com> wrote:
>>> Yeah, sorry... 
>>> 
>>> I meant that if you were dynamically creating a counter foo in the Mapper task, then each mapper would be creating their own counter foo. 
>>> As the job runs, these counters will eventually be sent up to the JT. The job tracker would keep a separate counter for each task. 
>>> 
>>> At the end, the final count is aggregated from the list of counters for foo. 
>>> 
>>> 
>>> I don't know how you can get a task to ask information from the Job Tracker on how things are going in other tasks.  That is what I meant that you couldn't get information about the other counters or even the status of the other tasks running in the same job. 
>>> 
>>> I didn't see anything in the APIs that allowed for that type of flow... Of course having said that... someone pops up with a way to do just that. ;-) 
>>> 
>>> 
>>> Does that clarify things? 
>>> 
>>> -Mike
>>> 
>>> 
>>> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
>>> 
>>>> Hi Mike,
>>>> 
>>>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>>>> 
>>>> From your this statement "It would make sense that the JT maintains a unique counter for each task until the tasks complete." -- it seems each task cannot see counters from each other, since JT maintains a unique counter for each tasks;
>>>> 
>>>> From your this comment "I meant that if a Task created and updated a counter, a different Task has access to that counter. " -- it seems different tasks could share/access the same counter.
>>>> 
>>>> Appreciate if you could help to clarify a bit.
>>>> 
>>>> regards,
>>>> Lin
>>>> 
>>>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <mi...@hotmail.com> wrote:
>>>> 
>>>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>>>> 
>>>>> Hi Mike,
>>>>> 
>>>>> Thanks for the detailed reply. Two quick questions/comments,
>>>>> 
>>>>> 1. For "task", you mean a specific mapper instance, or a specific reducer instance?
>>>> 
>>>> Either. 
>>>> 
>>>>> 2. "However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous." -- do you mean if a mapper is updating custom counter ABC, and another mapper is updating the same customer counter ABC, their counter values are updated independently by different mappers, and will not published (aggregated) externally until job completed successfully?
>>>>> 
>>>> I meant that if a Task created and updated a counter, a different Task has access to that counter. 
>>>> 
>>>> To give you an example, if I want to count the number of quality errors and then fail after X number of errors, I can't use Global counters to do this.
>>>> 
>>>>> regards,
>>>>> Lin
>>>>> 
>>>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <mi...@hotmail.com> wrote:
>>>>> As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status.
>>>>> The JT then will aggregate them. 
>>>>> 
>>>>> In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them, you can run in to issues. 
>>>>> As to limits... I guess that will depend on the amount of memory on the JT machine, the size of the cluster (Number of TT) and the number of counters. 
>>>>> 
>>>>> In terms of global accessibility... Maybe.
>>>>> 
>>>>> The reason I say maybe is that I'm not sure by what you mean by globally accessible. 
>>>>> If a task creates and implements a dynamic counter... I know that it will eventually be reflected in the JT. However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous.  Not to mention that I don't believe that the counters are aggregated until the job ends. It would make sense that the JT maintains a unique counter for each task until the tasks complete. (If a task fails, it would have to delete the counters so that when the task is restarted the correct count is maintained. )  Note, I haven't looked at the source code so I am probably wrong. 
>>>>> 
>>>>> HTH
>>>>> Mike
>>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>>> 
>>>>>> Hi guys,
>>>>>> 
>>>>>> I have some quick questions regarding to Hadoop counter,
>>>>>> 
>>>>>> Hadoop counter (customer defined) is global accessible (for both read and write) for all Mappers and Reducers in a job?
>>>>>> What is the performance and best practices of using Hadoop counters? I am not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole job?
>>>>>> regards,
>>>>>> Lin
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: Hadoop counter

Posted by Michael Segel <mi...@hotmail.com>.

Yup. 
The counters at the end of the job are the most accurate. 

On Oct 22, 2012, at 3:00 AM, Lin Ma <li...@gmail.com> wrote:

> Thanks for the help so much, Mike. I learned a lot from this discussion.
> 
> So, the conclusion I learned from the discussion should be, since how/when JT merge counter in the middle of the process of a job is undefined and internal behavior, it is more reliable to read counter after the whole job completes? Agree?
> 
> regards,
> Lin
> 
> On Sun, Oct 21, 2012 at 8:15 PM, Michael Segel <mi...@hotmail.com> wrote:
> 
> On Oct 21, 2012, at 1:45 AM, Lin Ma <li...@gmail.com> wrote:
> 
>> Thanks for the detailed reply, Mike. Yes, my most confusion is resolved by you. The last two questions (or comments) are used to confirm my understanding is correct,
>> 
>> - is it normal use case or best practices for a job to consume/read the counters from previous completed job in an automatic way? I ask this because I am not sure whether the most use case of counter is human read and manual analysis, other then using another job to automatic consume the counters?
> 
> Lin, 
> Every job has a set of counters to maintain job statistics. 
> This is specifically for human analysis and to help understand what happened with your job. 
> It allows you to see how much data is read in by the job, how many records processed to be measured against how long the job took to complete.  It also showed you how much data is written back out.  
> 
> In addition to this,  a set of use cases for counters in Hadoop center on quality control. Its normal to chain jobs together to form a job flow. 
> A typical use case for Hadoop is to pull data from various sources, combine them and do some process on them, resulting in a data set that gets sent to another system for visualization. 
> 
> In this use case, there are usually data cleansing and validation jobs. As they run, its possible to track a number of defective records. At the end of that specific job, from the ToolRunner, or whichever job class you used to launch your job, you can then get these aggregated counters for the job and determine if the process passed or failed.  Based on this, you can exit your program with either a success or failed flag.  Job Flow control tools like Oozie can capture this and then decide to continue or to stop and alert an operator of an error. 
> 
>> - I want to confirm my understanding is correct, when each task completes, JT will aggregate/update the global counter values from the specific counter values updated by the complete task, but never expose global counters values until job completes? If it is correct, I am wondering why JT doing aggregation each time when a task completes, other than doing a one time aggregation when the job completes? Is there any design choice reasons? thanks.
> 
> That's a good question. I haven't looked at the code, so I can't say definitively when the JT performs its aggregation. However, as the job runs and in process, we can look at the job tracker web page(s) and see the counter summary. This would imply that there has to be some aggregation occurring mid-flight. (It would be trivial to sum the list of counters periodically to update the job statistics.)  Note too that if the JT web pages can show a counter, its possible to then write a monitoring tool that can monitor the job while running and then kill the job mid flight if a certain threshold of a counter is met. 
> 
> That is to say you could in theory write a monitoring process and watch the counters. If lets say an error counter hits a predetermined threshold, you could then issue a 'hadoop job -kill <job-id>' command. 
> 
>> 
>> regards,
>> Lin
>> 
>> On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel <mi...@hotmail.com> wrote:
>> 
>> On Oct 19, 2012, at 10:27 PM, Lin Ma <li...@gmail.com> wrote:
>> 
>>> Thanks for the detailed reply Mike, I learned a lot from the discussion.
>>> 
>>> - I just want to confirm with you that, supposing in the same job, when a specific task completed (and counter is aggregated in JT after the task completed from our discussion?), the other running task in the same job cannot get the updated counter value from the previous completed task? I am asking this because I am thinking whether I can use counter to share a global value between tasks.
>> 
>> Yes that is correct. 
>> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy way for a task to query the job tracker. This might have changed in YARN
>> 
>>> - If so, what is the traditional use case of counter, only use counter values after the whole job completes?
>>> 
>> Yes the counters are used to provide data at the end of the job... 
>> 
>>> BTW: appreciate if you could share me a few use cases from your experience about how counters are used.
>>> 
>> Well you have your typical job data like the number of records processed, total number of bytes read,  bytes written... 
>> 
>> But suppose you wanted to do some quality control on your input. 
>> So you need to keep a track on the count of bad records.  If this job is part of a process, you may want to include business logic in your job to halt the job flow if X% of the records contain bad data. 
>> 
>> Or your process takes input records and in processing them, they sort the records based on some characteristic and you want to count those sorted records as you processed them. 
>> 
>> For a more concrete example, the Illinois Tollway has these 'fast pass' lanes where cars equipped with RFID tags can have the tolls automatically deducted from their accounts rather than pay the toll manually each time. 
>> 
>> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are cheaters where they drive through the sensor and the sensor doesn't capture the RFID tag. (Note its possible that you have a false positive where the car has an RFID chip but doesn't trip the sensor.) Pushing the data in a map/reduce job would require the use of counters.
>> 
>> Does that help? 
>> 
>> -Mike
>> 
>>> regards,
>>> Lin
>>> 
>>> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <mi...@hotmail.com> wrote:
>>> Yeah, sorry... 
>>> 
>>> I meant that if you were dynamically creating a counter foo in the Mapper task, then each mapper would be creating their own counter foo. 
>>> As the job runs, these counters will eventually be sent up to the JT. The job tracker would keep a separate counter for each task. 
>>> 
>>> At the end, the final count is aggregated from the list of counters for foo. 
>>> 
>>> 
>>> I don't know how you can get a task to ask information from the Job Tracker on how things are going in other tasks.  That is what I meant that you couldn't get information about the other counters or even the status of the other tasks running in the same job. 
>>> 
>>> I didn't see anything in the APIs that allowed for that type of flow... Of course having said that... someone pops up with a way to do just that. ;-) 
>>> 
>>> 
>>> Does that clarify things? 
>>> 
>>> -Mike
>>> 
>>> 
>>> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
>>> 
>>>> Hi Mike,
>>>> 
>>>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>>>> 
>>>> From your this statement "It would make sense that the JT maintains a unique counter for each task until the tasks complete." -- it seems each task cannot see counters from each other, since JT maintains a unique counter for each tasks;
>>>> 
>>>> From your this comment "I meant that if a Task created and updated a counter, a different Task has access to that counter. " -- it seems different tasks could share/access the same counter.
>>>> 
>>>> Appreciate if you could help to clarify a bit.
>>>> 
>>>> regards,
>>>> Lin
>>>> 
>>>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <mi...@hotmail.com> wrote:
>>>> 
>>>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>>>> 
>>>>> Hi Mike,
>>>>> 
>>>>> Thanks for the detailed reply. Two quick questions/comments,
>>>>> 
>>>>> 1. For "task", you mean a specific mapper instance, or a specific reducer instance?
>>>> 
>>>> Either. 
>>>> 
>>>>> 2. "However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous." -- do you mean if a mapper is updating custom counter ABC, and another mapper is updating the same customer counter ABC, their counter values are updated independently by different mappers, and will not published (aggregated) externally until job completed successfully?
>>>>> 
>>>> I meant that if a Task created and updated a counter, a different Task has access to that counter. 
>>>> 
>>>> To give you an example, if I want to count the number of quality errors and then fail after X number of errors, I can't use Global counters to do this.
>>>> 
>>>>> regards,
>>>>> Lin
>>>>> 
>>>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <mi...@hotmail.com> wrote:
>>>>> As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status.
>>>>> The JT then will aggregate them. 
>>>>> 
>>>>> In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them, you can run in to issues. 
>>>>> As to limits... I guess that will depend on the amount of memory on the JT machine, the size of the cluster (Number of TT) and the number of counters. 
>>>>> 
>>>>> In terms of global accessibility... Maybe.
>>>>> 
>>>>> The reason I say maybe is that I'm not sure by what you mean by globally accessible. 
>>>>> If a task creates and implements a dynamic counter... I know that it will eventually be reflected in the JT. However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous.  Not to mention that I don't believe that the counters are aggregated until the job ends. It would make sense that the JT maintains a unique counter for each task until the tasks complete. (If a task fails, it would have to delete the counters so that when the task is restarted the correct count is maintained. )  Note, I haven't looked at the source code so I am probably wrong. 
>>>>> 
>>>>> HTH
>>>>> Mike
>>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>>> 
>>>>>> Hi guys,
>>>>>> 
>>>>>> I have some quick questions regarding to Hadoop counter,
>>>>>> 
>>>>>> Hadoop counter (customer defined) is global accessible (for both read and write) for all Mappers and Reducers in a job?
>>>>>> What is the performance and best practices of using Hadoop counters? I am not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole job?
>>>>>> regards,
>>>>>> Lin
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: Hadoop counter

Posted by Michael Segel <mi...@hotmail.com>.

Yup. 
The counters at the end of the job are the most accurate. 

On Oct 22, 2012, at 3:00 AM, Lin Ma <li...@gmail.com> wrote:

> Thanks for the help so much, Mike. I learned a lot from this discussion.
> 
> So, the conclusion I learned from the discussion should be, since how/when JT merge counter in the middle of the process of a job is undefined and internal behavior, it is more reliable to read counter after the whole job completes? Agree?
> 
> regards,
> Lin
> 
> On Sun, Oct 21, 2012 at 8:15 PM, Michael Segel <mi...@hotmail.com> wrote:
> 
> On Oct 21, 2012, at 1:45 AM, Lin Ma <li...@gmail.com> wrote:
> 
>> Thanks for the detailed reply, Mike. Yes, my most confusion is resolved by you. The last two questions (or comments) are used to confirm my understanding is correct,
>> 
>> - is it normal use case or best practices for a job to consume/read the counters from previous completed job in an automatic way? I ask this because I am not sure whether the most use case of counter is human read and manual analysis, other then using another job to automatic consume the counters?
> 
> Lin, 
> Every job has a set of counters to maintain job statistics. 
> This is specifically for human analysis and to help understand what happened with your job. 
> It allows you to see how much data is read in by the job, how many records processed to be measured against how long the job took to complete.  It also showed you how much data is written back out.  
> 
> In addition to this,  a set of use cases for counters in Hadoop center on quality control. Its normal to chain jobs together to form a job flow. 
> A typical use case for Hadoop is to pull data from various sources, combine them and do some process on them, resulting in a data set that gets sent to another system for visualization. 
> 
> In this use case, there are usually data cleansing and validation jobs. As they run, its possible to track a number of defective records. At the end of that specific job, from the ToolRunner, or whichever job class you used to launch your job, you can then get these aggregated counters for the job and determine if the process passed or failed.  Based on this, you can exit your program with either a success or failed flag.  Job Flow control tools like Oozie can capture this and then decide to continue or to stop and alert an operator of an error. 
> 
>> - I want to confirm my understanding is correct, when each task completes, JT will aggregate/update the global counter values from the specific counter values updated by the complete task, but never expose global counters values until job completes? If it is correct, I am wondering why JT doing aggregation each time when a task completes, other than doing a one time aggregation when the job completes? Is there any design choice reasons? thanks.
> 
> That's a good question. I haven't looked at the code, so I can't say definitively when the JT performs its aggregation. However, as the job runs and in process, we can look at the job tracker web page(s) and see the counter summary. This would imply that there has to be some aggregation occurring mid-flight. (It would be trivial to sum the list of counters periodically to update the job statistics.)  Note too that if the JT web pages can show a counter, its possible to then write a monitoring tool that can monitor the job while running and then kill the job mid flight if a certain threshold of a counter is met. 
> 
> That is to say you could in theory write a monitoring process and watch the counters. If lets say an error counter hits a predetermined threshold, you could then issue a 'hadoop job -kill <job-id>' command. 
> 
>> 
>> regards,
>> Lin
>> 
>> On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel <mi...@hotmail.com> wrote:
>> 
>> On Oct 19, 2012, at 10:27 PM, Lin Ma <li...@gmail.com> wrote:
>> 
>>> Thanks for the detailed reply Mike, I learned a lot from the discussion.
>>> 
>>> - I just want to confirm with you that, supposing in the same job, when a specific task completed (and counter is aggregated in JT after the task completed from our discussion?), the other running task in the same job cannot get the updated counter value from the previous completed task? I am asking this because I am thinking whether I can use counter to share a global value between tasks.
>> 
>> Yes that is correct. 
>> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy way for a task to query the job tracker. This might have changed in YARN
>> 
>>> - If so, what is the traditional use case of counter, only use counter values after the whole job completes?
>>> 
>> Yes the counters are used to provide data at the end of the job... 
>> 
>>> BTW: appreciate if you could share me a few use cases from your experience about how counters are used.
>>> 
>> Well you have your typical job data like the number of records processed, total number of bytes read,  bytes written... 
>> 
>> But suppose you wanted to do some quality control on your input. 
>> So you need to keep a track on the count of bad records.  If this job is part of a process, you may want to include business logic in your job to halt the job flow if X% of the records contain bad data. 
>> 
>> Or your process takes input records and in processing them, they sort the records based on some characteristic and you want to count those sorted records as you processed them. 
>> 
>> For a more concrete example, the Illinois Tollway has these 'fast pass' lanes where cars equipped with RFID tags can have the tolls automatically deducted from their accounts rather than pay the toll manually each time. 
>> 
>> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are cheaters where they drive through the sensor and the sensor doesn't capture the RFID tag. (Note its possible that you have a false positive where the car has an RFID chip but doesn't trip the sensor.) Pushing the data in a map/reduce job would require the use of counters.
>> 
>> Does that help? 
>> 
>> -Mike
>> 
>>> regards,
>>> Lin
>>> 
>>> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <mi...@hotmail.com> wrote:
>>> Yeah, sorry... 
>>> 
>>> I meant that if you were dynamically creating a counter foo in the Mapper task, then each mapper would be creating their own counter foo. 
>>> As the job runs, these counters will eventually be sent up to the JT. The job tracker would keep a separate counter for each task. 
>>> 
>>> At the end, the final count is aggregated from the list of counters for foo. 
>>> 
>>> 
>>> I don't know how you can get a task to ask information from the Job Tracker on how things are going in other tasks.  That is what I meant that you couldn't get information about the other counters or even the status of the other tasks running in the same job. 
>>> 
>>> I didn't see anything in the APIs that allowed for that type of flow... Of course having said that... someone pops up with a way to do just that. ;-) 
>>> 
>>> 
>>> Does that clarify things? 
>>> 
>>> -Mike
>>> 
>>> 
>>> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
>>> 
>>>> Hi Mike,
>>>> 
>>>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>>>> 
>>>> From your this statement "It would make sense that the JT maintains a unique counter for each task until the tasks complete." -- it seems each task cannot see counters from each other, since JT maintains a unique counter for each tasks;
>>>> 
>>>> From your this comment "I meant that if a Task created and updated a counter, a different Task has access to that counter. " -- it seems different tasks could share/access the same counter.
>>>> 
>>>> Appreciate if you could help to clarify a bit.
>>>> 
>>>> regards,
>>>> Lin
>>>> 
>>>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <mi...@hotmail.com> wrote:
>>>> 
>>>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>>>> 
>>>>> Hi Mike,
>>>>> 
>>>>> Thanks for the detailed reply. Two quick questions/comments,
>>>>> 
>>>>> 1. For "task", you mean a specific mapper instance, or a specific reducer instance?
>>>> 
>>>> Either. 
>>>> 
>>>>> 2. "However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous." -- do you mean if a mapper is updating custom counter ABC, and another mapper is updating the same customer counter ABC, their counter values are updated independently by different mappers, and will not published (aggregated) externally until job completed successfully?
>>>>> 
>>>> I meant that if a Task created and updated a counter, a different Task has access to that counter. 
>>>> 
>>>> To give you an example, if I want to count the number of quality errors and then fail after X number of errors, I can't use Global counters to do this.
>>>> 
>>>>> regards,
>>>>> Lin
>>>>> 
>>>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <mi...@hotmail.com> wrote:
>>>>> As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status.
>>>>> The JT then will aggregate them. 
>>>>> 
>>>>> In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them, you can run in to issues. 
>>>>> As to limits... I guess that will depend on the amount of memory on the JT machine, the size of the cluster (Number of TT) and the number of counters. 
>>>>> 
>>>>> In terms of global accessibility... Maybe.
>>>>> 
>>>>> The reason I say maybe is that I'm not sure by what you mean by globally accessible. 
>>>>> If a task creates and implements a dynamic counter... I know that it will eventually be reflected in the JT. However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous.  Not to mention that I don't believe that the counters are aggregated until the job ends. It would make sense that the JT maintains a unique counter for each task until the tasks complete. (If a task fails, it would have to delete the counters so that when the task is restarted the correct count is maintained. )  Note, I haven't looked at the source code so I am probably wrong. 
>>>>> 
>>>>> HTH
>>>>> Mike
>>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>>> 
>>>>>> Hi guys,
>>>>>> 
>>>>>> I have some quick questions regarding to Hadoop counter,
>>>>>> 
>>>>>> Hadoop counter (customer defined) is global accessible (for both read and write) for all Mappers and Reducers in a job?
>>>>>> What is the performance and best practices of using Hadoop counters? I am not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole job?
>>>>>> regards,
>>>>>> Lin
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: Hadoop counter

Posted by Michael Segel <mi...@hotmail.com>.

Yup. 
The counters at the end of the job are the most accurate. 

On Oct 22, 2012, at 3:00 AM, Lin Ma <li...@gmail.com> wrote:

> Thanks for the help so much, Mike. I learned a lot from this discussion.
> 
> So, the conclusion I learned from the discussion should be, since how/when JT merge counter in the middle of the process of a job is undefined and internal behavior, it is more reliable to read counter after the whole job completes? Agree?
> 
> regards,
> Lin
> 
> On Sun, Oct 21, 2012 at 8:15 PM, Michael Segel <mi...@hotmail.com> wrote:
> 
> On Oct 21, 2012, at 1:45 AM, Lin Ma <li...@gmail.com> wrote:
> 
>> Thanks for the detailed reply, Mike. Yes, my most confusion is resolved by you. The last two questions (or comments) are used to confirm my understanding is correct,
>> 
>> - is it normal use case or best practices for a job to consume/read the counters from previous completed job in an automatic way? I ask this because I am not sure whether the most use case of counter is human read and manual analysis, other then using another job to automatic consume the counters?
> 
> Lin, 
> Every job has a set of counters to maintain job statistics. 
> This is specifically for human analysis and to help understand what happened with your job. 
> It allows you to see how much data is read in by the job, how many records processed to be measured against how long the job took to complete.  It also showed you how much data is written back out.  
> 
> In addition to this,  a set of use cases for counters in Hadoop center on quality control. Its normal to chain jobs together to form a job flow. 
> A typical use case for Hadoop is to pull data from various sources, combine them and do some process on them, resulting in a data set that gets sent to another system for visualization. 
> 
> In this use case, there are usually data cleansing and validation jobs. As they run, its possible to track a number of defective records. At the end of that specific job, from the ToolRunner, or whichever job class you used to launch your job, you can then get these aggregated counters for the job and determine if the process passed or failed.  Based on this, you can exit your program with either a success or failed flag.  Job Flow control tools like Oozie can capture this and then decide to continue or to stop and alert an operator of an error. 
> 
>> - I want to confirm my understanding is correct, when each task completes, JT will aggregate/update the global counter values from the specific counter values updated by the complete task, but never expose global counters values until job completes? If it is correct, I am wondering why JT doing aggregation each time when a task completes, other than doing a one time aggregation when the job completes? Is there any design choice reasons? thanks.
> 
> That's a good question. I haven't looked at the code, so I can't say definitively when the JT performs its aggregation. However, as the job runs and in process, we can look at the job tracker web page(s) and see the counter summary. This would imply that there has to be some aggregation occurring mid-flight. (It would be trivial to sum the list of counters periodically to update the job statistics.)  Note too that if the JT web pages can show a counter, its possible to then write a monitoring tool that can monitor the job while running and then kill the job mid flight if a certain threshold of a counter is met. 
> 
> That is to say you could in theory write a monitoring process and watch the counters. If lets say an error counter hits a predetermined threshold, you could then issue a 'hadoop job -kill <job-id>' command. 
> 
>> 
>> regards,
>> Lin
>> 
>> On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel <mi...@hotmail.com> wrote:
>> 
>> On Oct 19, 2012, at 10:27 PM, Lin Ma <li...@gmail.com> wrote:
>> 
>>> Thanks for the detailed reply Mike, I learned a lot from the discussion.
>>> 
>>> - I just want to confirm with you that, supposing in the same job, when a specific task completed (and counter is aggregated in JT after the task completed from our discussion?), the other running task in the same job cannot get the updated counter value from the previous completed task? I am asking this because I am thinking whether I can use counter to share a global value between tasks.
>> 
>> Yes that is correct. 
>> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy way for a task to query the job tracker. This might have changed in YARN
>> 
>>> - If so, what is the traditional use case of counter, only use counter values after the whole job completes?
>>> 
>> Yes the counters are used to provide data at the end of the job... 
>> 
>>> BTW: appreciate if you could share me a few use cases from your experience about how counters are used.
>>> 
>> Well you have your typical job data like the number of records processed, total number of bytes read,  bytes written... 
>> 
>> But suppose you wanted to do some quality control on your input. 
>> So you need to keep a track on the count of bad records.  If this job is part of a process, you may want to include business logic in your job to halt the job flow if X% of the records contain bad data. 
>> 
>> Or your process takes input records and in processing them, they sort the records based on some characteristic and you want to count those sorted records as you processed them. 
>> 
>> For a more concrete example, the Illinois Tollway has these 'fast pass' lanes where cars equipped with RFID tags can have the tolls automatically deducted from their accounts rather than pay the toll manually each time. 
>> 
>> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are cheaters where they drive through the sensor and the sensor doesn't capture the RFID tag. (Note its possible that you have a false positive where the car has an RFID chip but doesn't trip the sensor.) Pushing the data in a map/reduce job would require the use of counters.
>> 
>> Does that help? 
>> 
>> -Mike
>> 
>>> regards,
>>> Lin
>>> 
>>> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <mi...@hotmail.com> wrote:
>>> Yeah, sorry... 
>>> 
>>> I meant that if you were dynamically creating a counter foo in the Mapper task, then each mapper would be creating their own counter foo. 
>>> As the job runs, these counters will eventually be sent up to the JT. The job tracker would keep a separate counter for each task. 
>>> 
>>> At the end, the final count is aggregated from the list of counters for foo. 
>>> 
>>> 
>>> I don't know how you can get a task to ask information from the Job Tracker on how things are going in other tasks.  That is what I meant that you couldn't get information about the other counters or even the status of the other tasks running in the same job. 
>>> 
>>> I didn't see anything in the APIs that allowed for that type of flow... Of course having said that... someone pops up with a way to do just that. ;-) 
>>> 
>>> 
>>> Does that clarify things? 
>>> 
>>> -Mike
>>> 
>>> 
>>> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
>>> 
>>>> Hi Mike,
>>>> 
>>>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>>>> 
>>>> From your this statement "It would make sense that the JT maintains a unique counter for each task until the tasks complete." -- it seems each task cannot see counters from each other, since JT maintains a unique counter for each tasks;
>>>> 
>>>> From your this comment "I meant that if a Task created and updated a counter, a different Task has access to that counter. " -- it seems different tasks could share/access the same counter.
>>>> 
>>>> Appreciate if you could help to clarify a bit.
>>>> 
>>>> regards,
>>>> Lin
>>>> 
>>>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <mi...@hotmail.com> wrote:
>>>> 
>>>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>>>> 
>>>>> Hi Mike,
>>>>> 
>>>>> Thanks for the detailed reply. Two quick questions/comments,
>>>>> 
>>>>> 1. For "task", you mean a specific mapper instance, or a specific reducer instance?
>>>> 
>>>> Either. 
>>>> 
>>>>> 2. "However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous." -- do you mean if a mapper is updating custom counter ABC, and another mapper is updating the same customer counter ABC, their counter values are updated independently by different mappers, and will not published (aggregated) externally until job completed successfully?
>>>>> 
>>>> I meant that if a Task created and updated a counter, a different Task has access to that counter. 
>>>> 
>>>> To give you an example, if I want to count the number of quality errors and then fail after X number of errors, I can't use Global counters to do this.
>>>> 
>>>>> regards,
>>>>> Lin
>>>>> 
>>>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <mi...@hotmail.com> wrote:
>>>>> As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status.
>>>>> The JT then will aggregate them. 
>>>>> 
>>>>> In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them, you can run in to issues. 
>>>>> As to limits... I guess that will depend on the amount of memory on the JT machine, the size of the cluster (Number of TT) and the number of counters. 
>>>>> 
>>>>> In terms of global accessibility... Maybe.
>>>>> 
>>>>> The reason I say maybe is that I'm not sure by what you mean by globally accessible. 
>>>>> If a task creates and implements a dynamic counter... I know that it will eventually be reflected in the JT. However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous.  Not to mention that I don't believe that the counters are aggregated until the job ends. It would make sense that the JT maintains a unique counter for each task until the tasks complete. (If a task fails, it would have to delete the counters so that when the task is restarted the correct count is maintained. )  Note, I haven't looked at the source code so I am probably wrong. 
>>>>> 
>>>>> HTH
>>>>> Mike
>>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>>> 
>>>>>> Hi guys,
>>>>>> 
>>>>>> I have some quick questions regarding to Hadoop counter,
>>>>>> 
>>>>>> Hadoop counter (customer defined) is global accessible (for both read and write) for all Mappers and Reducers in a job?
>>>>>> What is the performance and best practices of using Hadoop counters? I am not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole job?
>>>>>> regards,
>>>>>> Lin
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Thanks for the help so much, Mike. I learned a lot from this discussion.

So, the conclusion I learned from the discussion should be, since how/when
JT merge counter in the middle of the process of a job is undefined and
internal behavior, it is more reliable to read counter after the whole job
completes? Agree?

regards,
Lin

On Sun, Oct 21, 2012 at 8:15 PM, Michael Segel <mi...@hotmail.com>wrote:

>
> On Oct 21, 2012, at 1:45 AM, Lin Ma <li...@gmail.com> wrote:
>
> Thanks for the detailed reply, Mike. Yes, my most confusion is resolved by
> you. The last two questions (or comments) are used to confirm my
> understanding is correct,
>
> - is it normal use case or best practices for a job to consume/read the
> counters from previous completed job in an automatic way? I ask this
> because I am not sure whether the most use case of counter is human read
> and manual analysis, other then using another job to automatic consume the
> counters?
>
>
> Lin,
> Every job has a set of counters to maintain job statistics.
> This is specifically for human analysis and to help understand what
> happened with your job.
> It allows you to see how much data is read in by the job, how many records
> processed to be measured against how long the job took to complete.  It
> also showed you how much data is written back out.
>
> In addition to this,  a set of use cases for counters in Hadoop center on
> quality control. Its normal to chain jobs together to form a job flow.
> A typical use case for Hadoop is to pull data from various sources,
> combine them and do some process on them, resulting in a data set that gets
> sent to another system for visualization.
>
> In this use case, there are usually data cleansing and validation jobs. As
> they run, its possible to track a number of defective records. At the end
> of that specific job, from the ToolRunner, or whichever job class you used
> to launch your job, you can then get these aggregated counters for the job
> and determine if the process passed or failed.  Based on this, you can exit
> your program with either a success or failed flag.  Job Flow control tools
> like Oozie can capture this and then decide to continue or to stop and
> alert an operator of an error.
>
> - I want to confirm my understanding is correct, when each task completes,
> JT will aggregate/update the global counter values from the specific
> counter values updated by the complete task, but never expose global
> counters values until job completes? If it is correct, I am wondering why
> JT doing aggregation each time when a task completes, other than doing a
> one time aggregation when the job completes? Is there any design choice
> reasons? thanks.
>
>
> That's a good question. I haven't looked at the code, so I can't say
> definitively when the JT performs its aggregation. However, as the job runs
> and in process, we can look at the job tracker web page(s) and see the
> counter summary. This would imply that there has to be some aggregation
> occurring mid-flight. (It would be trivial to sum the list of counters
> periodically to update the job statistics.)  Note too that if the JT web
> pages can show a counter, its possible to then write a monitoring tool that
> can monitor the job while running and then kill the job mid flight if a
> certain threshold of a counter is met.
>
> That is to say you could in theory write a monitoring process and watch
> the counters. If lets say an error counter hits a predetermined threshold,
> you could then issue a 'hadoop job -kill <job-id>' command.
>
>
> regards,
> Lin
>
> On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>>
>> On Oct 19, 2012, at 10:27 PM, Lin Ma <li...@gmail.com> wrote:
>>
>> Thanks for the detailed reply Mike, I learned a lot from the discussion.
>>
>> - I just want to confirm with you that, supposing in the same job, when a
>> specific task completed (and counter is aggregated in JT after the task
>> completed from our discussion?), the other running task in the same job
>> cannot get the updated counter value from the previous completed task? I am
>> asking this because I am thinking whether I can use counter to share a
>> global value between tasks.
>>
>>
>> Yes that is correct.
>> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy
>> way for a task to query the job tracker. This might have changed in YARN
>>
>> - If so, what is the traditional use case of counter, only use counter
>> values after the whole job completes?
>>
>> Yes the counters are used to provide data at the end of the job...
>>
>> BTW: appreciate if you could share me a few use cases from your
>> experience about how counters are used.
>>
>> Well you have your typical job data like the number of records processed,
>> total number of bytes read,  bytes written...
>>
>> But suppose you wanted to do some quality control on your input.
>> So you need to keep a track on the count of bad records.  If this job is
>> part of a process, you may want to include business logic in your job to
>> halt the job flow if X% of the records contain bad data.
>>
>> Or your process takes input records and in processing them, they sort the
>> records based on some characteristic and you want to count those sorted
>> records as you processed them.
>>
>> For a more concrete example, the Illinois Tollway has these 'fast pass'
>> lanes where cars equipped with RFID tags can have the tolls automatically
>> deducted from their accounts rather than pay the toll manually each time.
>>
>> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are
>> cheaters where they drive through the sensor and the sensor doesn't capture
>> the RFID tag. (Note its possible that you have a false positive where the
>> car has an RFID chip but doesn't trip the sensor.) Pushing the data in a
>> map/reduce job would require the use of counters.
>>
>> Does that help?
>>
>> -Mike
>>
>> regards,
>> Lin
>>
>> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <michael_segel@hotmail.com
>> > wrote:
>>
>>> Yeah, sorry...
>>>
>>> I meant that if you were dynamically creating a counter foo in the
>>> Mapper task, then each mapper would be creating their own counter foo.
>>> As the job runs, these counters will eventually be sent up to the JT.
>>> The job tracker would keep a separate counter for each task.
>>>
>>> At the end, the final count is aggregated from the list of counters for
>>> foo.
>>>
>>>
>>> I don't know how you can get a task to ask information from the Job
>>> Tracker on how things are going in other tasks.  That is what I meant that
>>> you couldn't get information about the other counters or even the status of
>>> the other tasks running in the same job.
>>>
>>> I didn't see anything in the APIs that allowed for that type of flow...
>>> Of course having said that... someone pops up with a way to do just that.
>>> ;-)
>>>
>>>
>>> Does that clarify things?
>>>
>>> -Mike
>>>
>>>
>>> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
>>>
>>> Hi Mike,
>>>
>>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>>>
>>> From your this statement "It would make sense that the JT maintains a
>>> unique counter for each task until the tasks complete." -- it seems each
>>> task cannot see counters from each other, since JT maintains a unique
>>> counter for each tasks;
>>>
>>> From your this comment "I meant that if a Task created and updated a
>>> counter, a different Task has access to that counter. " -- it seems
>>> different tasks could share/access the same counter.
>>>
>>> Appreciate if you could help to clarify a bit.
>>>
>>> regards,
>>> Lin
>>>
>>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>>
>>>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>>>>
>>>> Hi Mike,
>>>>
>>>> Thanks for the detailed reply. Two quick questions/comments,
>>>>
>>>> 1. For "task", you mean a specific mapper instance, or a specific
>>>> reducer instance?
>>>>
>>>>
>>>> Either.
>>>>
>>>> 2. "However, I do not believe that a separate Task could connect with
>>>> the JT and see if the counter exists or if it could get a value or even an
>>>> accurate value since the updates are asynchronous." -- do you mean if a
>>>> mapper is updating custom counter ABC, and another mapper is updating the
>>>> same customer counter ABC, their counter values are updated independently
>>>> by different mappers, and will not published (aggregated) externally until
>>>> job completed successfully?
>>>>
>>>> I meant that if a Task created and updated a counter, a different Task
>>>> has access to that counter.
>>>>
>>>> To give you an example, if I want to count the number of quality errors
>>>> and then fail after X number of errors, I can't use Global counters to do
>>>> this.
>>>>
>>>> regards,
>>>> Lin
>>>>
>>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <
>>>> michael_segel@hotmail.com> wrote:
>>>>
>>>>> As I understand it... each Task has its own counters and are
>>>>> independently updated. As they report back to the JT, they update the
>>>>> counter(s)' status.
>>>>> The JT then will aggregate them.
>>>>>
>>>>> In terms of performance, Counters take up some memory in the JT so
>>>>> while its OK to use them, if you abuse them, you can run in to issues.
>>>>> As to limits... I guess that will depend on the amount of memory on
>>>>> the JT machine, the size of the cluster (Number of TT) and the number of
>>>>> counters.
>>>>>
>>>>> In terms of global accessibility... Maybe.
>>>>>
>>>>> The reason I say maybe is that I'm not sure by what you mean by
>>>>> globally accessible.
>>>>> If a task creates and implements a dynamic counter... I know that it
>>>>> will eventually be reflected in the JT. However, I do not believe that a
>>>>> separate Task could connect with the JT and see if the counter exists or if
>>>>> it could get a value or even an accurate value since the updates are
>>>>> asynchronous.  Not to mention that I don't believe that the counters are
>>>>> aggregated until the job ends. It would make sense that the JT maintains a
>>>>> unique counter for each task until the tasks complete. (If a task fails, it
>>>>> would have to delete the counters so that when the task is restarted the
>>>>> correct count is maintained. )  Note, I haven't looked at the source code
>>>>> so I am probably wrong.
>>>>>
>>>>> HTH
>>>>> Mike
>>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>>>
>>>>> Hi guys,
>>>>>
>>>>> I have some quick questions regarding to Hadoop counter,
>>>>>
>>>>>
>>>>>    - Hadoop counter (customer defined) is global accessible (for both
>>>>>    read and write) for all Mappers and Reducers in a job?
>>>>>    - What is the performance and best practices of using Hadoop
>>>>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>>>>    performance downgrade to the whole job?
>>>>>
>>>>> regards,
>>>>> Lin
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Thanks for the help so much, Mike. I learned a lot from this discussion.

So, the conclusion I learned from the discussion should be, since how/when
JT merge counter in the middle of the process of a job is undefined and
internal behavior, it is more reliable to read counter after the whole job
completes? Agree?

regards,
Lin

On Sun, Oct 21, 2012 at 8:15 PM, Michael Segel <mi...@hotmail.com>wrote:

>
> On Oct 21, 2012, at 1:45 AM, Lin Ma <li...@gmail.com> wrote:
>
> Thanks for the detailed reply, Mike. Yes, my most confusion is resolved by
> you. The last two questions (or comments) are used to confirm my
> understanding is correct,
>
> - is it normal use case or best practices for a job to consume/read the
> counters from previous completed job in an automatic way? I ask this
> because I am not sure whether the most use case of counter is human read
> and manual analysis, other then using another job to automatic consume the
> counters?
>
>
> Lin,
> Every job has a set of counters to maintain job statistics.
> This is specifically for human analysis and to help understand what
> happened with your job.
> It allows you to see how much data is read in by the job, how many records
> processed to be measured against how long the job took to complete.  It
> also showed you how much data is written back out.
>
> In addition to this,  a set of use cases for counters in Hadoop center on
> quality control. Its normal to chain jobs together to form a job flow.
> A typical use case for Hadoop is to pull data from various sources,
> combine them and do some process on them, resulting in a data set that gets
> sent to another system for visualization.
>
> In this use case, there are usually data cleansing and validation jobs. As
> they run, its possible to track a number of defective records. At the end
> of that specific job, from the ToolRunner, or whichever job class you used
> to launch your job, you can then get these aggregated counters for the job
> and determine if the process passed or failed.  Based on this, you can exit
> your program with either a success or failed flag.  Job Flow control tools
> like Oozie can capture this and then decide to continue or to stop and
> alert an operator of an error.
>
> - I want to confirm my understanding is correct, when each task completes,
> JT will aggregate/update the global counter values from the specific
> counter values updated by the complete task, but never expose global
> counters values until job completes? If it is correct, I am wondering why
> JT doing aggregation each time when a task completes, other than doing a
> one time aggregation when the job completes? Is there any design choice
> reasons? thanks.
>
>
> That's a good question. I haven't looked at the code, so I can't say
> definitively when the JT performs its aggregation. However, as the job runs
> and in process, we can look at the job tracker web page(s) and see the
> counter summary. This would imply that there has to be some aggregation
> occurring mid-flight. (It would be trivial to sum the list of counters
> periodically to update the job statistics.)  Note too that if the JT web
> pages can show a counter, its possible to then write a monitoring tool that
> can monitor the job while running and then kill the job mid flight if a
> certain threshold of a counter is met.
>
> That is to say you could in theory write a monitoring process and watch
> the counters. If lets say an error counter hits a predetermined threshold,
> you could then issue a 'hadoop job -kill <job-id>' command.
>
>
> regards,
> Lin
>
> On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>>
>> On Oct 19, 2012, at 10:27 PM, Lin Ma <li...@gmail.com> wrote:
>>
>> Thanks for the detailed reply Mike, I learned a lot from the discussion.
>>
>> - I just want to confirm with you that, supposing in the same job, when a
>> specific task completed (and counter is aggregated in JT after the task
>> completed from our discussion?), the other running task in the same job
>> cannot get the updated counter value from the previous completed task? I am
>> asking this because I am thinking whether I can use counter to share a
>> global value between tasks.
>>
>>
>> Yes that is correct.
>> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy
>> way for a task to query the job tracker. This might have changed in YARN
>>
>> - If so, what is the traditional use case of counter, only use counter
>> values after the whole job completes?
>>
>> Yes the counters are used to provide data at the end of the job...
>>
>> BTW: appreciate if you could share me a few use cases from your
>> experience about how counters are used.
>>
>> Well you have your typical job data like the number of records processed,
>> total number of bytes read,  bytes written...
>>
>> But suppose you wanted to do some quality control on your input.
>> So you need to keep a track on the count of bad records.  If this job is
>> part of a process, you may want to include business logic in your job to
>> halt the job flow if X% of the records contain bad data.
>>
>> Or your process takes input records and in processing them, they sort the
>> records based on some characteristic and you want to count those sorted
>> records as you processed them.
>>
>> For a more concrete example, the Illinois Tollway has these 'fast pass'
>> lanes where cars equipped with RFID tags can have the tolls automatically
>> deducted from their accounts rather than pay the toll manually each time.
>>
>> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are
>> cheaters where they drive through the sensor and the sensor doesn't capture
>> the RFID tag. (Note its possible that you have a false positive where the
>> car has an RFID chip but doesn't trip the sensor.) Pushing the data in a
>> map/reduce job would require the use of counters.
>>
>> Does that help?
>>
>> -Mike
>>
>> regards,
>> Lin
>>
>> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <michael_segel@hotmail.com
>> > wrote:
>>
>>> Yeah, sorry...
>>>
>>> I meant that if you were dynamically creating a counter foo in the
>>> Mapper task, then each mapper would be creating their own counter foo.
>>> As the job runs, these counters will eventually be sent up to the JT.
>>> The job tracker would keep a separate counter for each task.
>>>
>>> At the end, the final count is aggregated from the list of counters for
>>> foo.
>>>
>>>
>>> I don't know how you can get a task to ask information from the Job
>>> Tracker on how things are going in other tasks.  That is what I meant that
>>> you couldn't get information about the other counters or even the status of
>>> the other tasks running in the same job.
>>>
>>> I didn't see anything in the APIs that allowed for that type of flow...
>>> Of course having said that... someone pops up with a way to do just that.
>>> ;-)
>>>
>>>
>>> Does that clarify things?
>>>
>>> -Mike
>>>
>>>
>>> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
>>>
>>> Hi Mike,
>>>
>>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>>>
>>> From your this statement "It would make sense that the JT maintains a
>>> unique counter for each task until the tasks complete." -- it seems each
>>> task cannot see counters from each other, since JT maintains a unique
>>> counter for each tasks;
>>>
>>> From your this comment "I meant that if a Task created and updated a
>>> counter, a different Task has access to that counter. " -- it seems
>>> different tasks could share/access the same counter.
>>>
>>> Appreciate if you could help to clarify a bit.
>>>
>>> regards,
>>> Lin
>>>
>>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>>
>>>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>>>>
>>>> Hi Mike,
>>>>
>>>> Thanks for the detailed reply. Two quick questions/comments,
>>>>
>>>> 1. For "task", you mean a specific mapper instance, or a specific
>>>> reducer instance?
>>>>
>>>>
>>>> Either.
>>>>
>>>> 2. "However, I do not believe that a separate Task could connect with
>>>> the JT and see if the counter exists or if it could get a value or even an
>>>> accurate value since the updates are asynchronous." -- do you mean if a
>>>> mapper is updating custom counter ABC, and another mapper is updating the
>>>> same customer counter ABC, their counter values are updated independently
>>>> by different mappers, and will not published (aggregated) externally until
>>>> job completed successfully?
>>>>
>>>> I meant that if a Task created and updated a counter, a different Task
>>>> has access to that counter.
>>>>
>>>> To give you an example, if I want to count the number of quality errors
>>>> and then fail after X number of errors, I can't use Global counters to do
>>>> this.
>>>>
>>>> regards,
>>>> Lin
>>>>
>>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <
>>>> michael_segel@hotmail.com> wrote:
>>>>
>>>>> As I understand it... each Task has its own counters and are
>>>>> independently updated. As they report back to the JT, they update the
>>>>> counter(s)' status.
>>>>> The JT then will aggregate them.
>>>>>
>>>>> In terms of performance, Counters take up some memory in the JT so
>>>>> while its OK to use them, if you abuse them, you can run in to issues.
>>>>> As to limits... I guess that will depend on the amount of memory on
>>>>> the JT machine, the size of the cluster (Number of TT) and the number of
>>>>> counters.
>>>>>
>>>>> In terms of global accessibility... Maybe.
>>>>>
>>>>> The reason I say maybe is that I'm not sure by what you mean by
>>>>> globally accessible.
>>>>> If a task creates and implements a dynamic counter... I know that it
>>>>> will eventually be reflected in the JT. However, I do not believe that a
>>>>> separate Task could connect with the JT and see if the counter exists or if
>>>>> it could get a value or even an accurate value since the updates are
>>>>> asynchronous.  Not to mention that I don't believe that the counters are
>>>>> aggregated until the job ends. It would make sense that the JT maintains a
>>>>> unique counter for each task until the tasks complete. (If a task fails, it
>>>>> would have to delete the counters so that when the task is restarted the
>>>>> correct count is maintained. )  Note, I haven't looked at the source code
>>>>> so I am probably wrong.
>>>>>
>>>>> HTH
>>>>> Mike
>>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>>>
>>>>> Hi guys,
>>>>>
>>>>> I have some quick questions regarding to Hadoop counter,
>>>>>
>>>>>
>>>>>    - Hadoop counter (customer defined) is global accessible (for both
>>>>>    read and write) for all Mappers and Reducers in a job?
>>>>>    - What is the performance and best practices of using Hadoop
>>>>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>>>>    performance downgrade to the whole job?
>>>>>
>>>>> regards,
>>>>> Lin
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Java heap space error

Posted by Subash D'Souza <sd...@truecar.com>.

I'm running CDH 4 on  a 4 node cluster each with 96 G of RAM. Up until last week the cluster was running until there was an error in the name node log file and I had to reformat it put the data back

Now when I run hive on YARN. I keep getting a Java heap space error. Based on the research I did. I upped the my mapred.child.java.opts first from 200m to 400 m to 800m and I still have the same issue. It seems to fail near the 100% mapper mark

I checked the log files and the only thing that it does output is java heap space error. Nothing more.

Any help would be appreciated.

Thanks
Subash

Java heap space error

Posted by Subash D'Souza <sd...@truecar.com>.

I'm running CDH 4 on  a 4 node cluster each with 96 G of RAM. Up until last week the cluster was running until there was an error in the name node log file and I had to reformat it put the data back

Now when I run hive on YARN. I keep getting a Java heap space error. Based on the research I did. I upped the my mapred.child.java.opts first from 200m to 400 m to 800m and I still have the same issue. It seems to fail near the 100% mapper mark

I checked the log files and the only thing that it does output is java heap space error. Nothing more.

Any help would be appreciated.

Thanks
Subash

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Thanks for the help so much, Mike. I learned a lot from this discussion.

So, the conclusion I learned from the discussion should be, since how/when
JT merge counter in the middle of the process of a job is undefined and
internal behavior, it is more reliable to read counter after the whole job
completes? Agree?

regards,
Lin

On Sun, Oct 21, 2012 at 8:15 PM, Michael Segel <mi...@hotmail.com>wrote:

>
> On Oct 21, 2012, at 1:45 AM, Lin Ma <li...@gmail.com> wrote:
>
> Thanks for the detailed reply, Mike. Yes, my most confusion is resolved by
> you. The last two questions (or comments) are used to confirm my
> understanding is correct,
>
> - is it normal use case or best practices for a job to consume/read the
> counters from previous completed job in an automatic way? I ask this
> because I am not sure whether the most use case of counter is human read
> and manual analysis, other then using another job to automatic consume the
> counters?
>
>
> Lin,
> Every job has a set of counters to maintain job statistics.
> This is specifically for human analysis and to help understand what
> happened with your job.
> It allows you to see how much data is read in by the job, how many records
> processed to be measured against how long the job took to complete.  It
> also showed you how much data is written back out.
>
> In addition to this,  a set of use cases for counters in Hadoop center on
> quality control. Its normal to chain jobs together to form a job flow.
> A typical use case for Hadoop is to pull data from various sources,
> combine them and do some process on them, resulting in a data set that gets
> sent to another system for visualization.
>
> In this use case, there are usually data cleansing and validation jobs. As
> they run, its possible to track a number of defective records. At the end
> of that specific job, from the ToolRunner, or whichever job class you used
> to launch your job, you can then get these aggregated counters for the job
> and determine if the process passed or failed.  Based on this, you can exit
> your program with either a success or failed flag.  Job Flow control tools
> like Oozie can capture this and then decide to continue or to stop and
> alert an operator of an error.
>
> - I want to confirm my understanding is correct, when each task completes,
> JT will aggregate/update the global counter values from the specific
> counter values updated by the complete task, but never expose global
> counters values until job completes? If it is correct, I am wondering why
> JT doing aggregation each time when a task completes, other than doing a
> one time aggregation when the job completes? Is there any design choice
> reasons? thanks.
>
>
> That's a good question. I haven't looked at the code, so I can't say
> definitively when the JT performs its aggregation. However, as the job runs
> and in process, we can look at the job tracker web page(s) and see the
> counter summary. This would imply that there has to be some aggregation
> occurring mid-flight. (It would be trivial to sum the list of counters
> periodically to update the job statistics.)  Note too that if the JT web
> pages can show a counter, its possible to then write a monitoring tool that
> can monitor the job while running and then kill the job mid flight if a
> certain threshold of a counter is met.
>
> That is to say you could in theory write a monitoring process and watch
> the counters. If lets say an error counter hits a predetermined threshold,
> you could then issue a 'hadoop job -kill <job-id>' command.
>
>
> regards,
> Lin
>
> On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>>
>> On Oct 19, 2012, at 10:27 PM, Lin Ma <li...@gmail.com> wrote:
>>
>> Thanks for the detailed reply Mike, I learned a lot from the discussion.
>>
>> - I just want to confirm with you that, supposing in the same job, when a
>> specific task completed (and counter is aggregated in JT after the task
>> completed from our discussion?), the other running task in the same job
>> cannot get the updated counter value from the previous completed task? I am
>> asking this because I am thinking whether I can use counter to share a
>> global value between tasks.
>>
>>
>> Yes that is correct.
>> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy
>> way for a task to query the job tracker. This might have changed in YARN
>>
>> - If so, what is the traditional use case of counter, only use counter
>> values after the whole job completes?
>>
>> Yes the counters are used to provide data at the end of the job...
>>
>> BTW: appreciate if you could share me a few use cases from your
>> experience about how counters are used.
>>
>> Well you have your typical job data like the number of records processed,
>> total number of bytes read,  bytes written...
>>
>> But suppose you wanted to do some quality control on your input.
>> So you need to keep a track on the count of bad records.  If this job is
>> part of a process, you may want to include business logic in your job to
>> halt the job flow if X% of the records contain bad data.
>>
>> Or your process takes input records and in processing them, they sort the
>> records based on some characteristic and you want to count those sorted
>> records as you processed them.
>>
>> For a more concrete example, the Illinois Tollway has these 'fast pass'
>> lanes where cars equipped with RFID tags can have the tolls automatically
>> deducted from their accounts rather than pay the toll manually each time.
>>
>> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are
>> cheaters where they drive through the sensor and the sensor doesn't capture
>> the RFID tag. (Note its possible that you have a false positive where the
>> car has an RFID chip but doesn't trip the sensor.) Pushing the data in a
>> map/reduce job would require the use of counters.
>>
>> Does that help?
>>
>> -Mike
>>
>> regards,
>> Lin
>>
>> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <michael_segel@hotmail.com
>> > wrote:
>>
>>> Yeah, sorry...
>>>
>>> I meant that if you were dynamically creating a counter foo in the
>>> Mapper task, then each mapper would be creating their own counter foo.
>>> As the job runs, these counters will eventually be sent up to the JT.
>>> The job tracker would keep a separate counter for each task.
>>>
>>> At the end, the final count is aggregated from the list of counters for
>>> foo.
>>>
>>>
>>> I don't know how you can get a task to ask information from the Job
>>> Tracker on how things are going in other tasks.  That is what I meant that
>>> you couldn't get information about the other counters or even the status of
>>> the other tasks running in the same job.
>>>
>>> I didn't see anything in the APIs that allowed for that type of flow...
>>> Of course having said that... someone pops up with a way to do just that.
>>> ;-)
>>>
>>>
>>> Does that clarify things?
>>>
>>> -Mike
>>>
>>>
>>> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
>>>
>>> Hi Mike,
>>>
>>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>>>
>>> From your this statement "It would make sense that the JT maintains a
>>> unique counter for each task until the tasks complete." -- it seems each
>>> task cannot see counters from each other, since JT maintains a unique
>>> counter for each tasks;
>>>
>>> From your this comment "I meant that if a Task created and updated a
>>> counter, a different Task has access to that counter. " -- it seems
>>> different tasks could share/access the same counter.
>>>
>>> Appreciate if you could help to clarify a bit.
>>>
>>> regards,
>>> Lin
>>>
>>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>>
>>>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>>>>
>>>> Hi Mike,
>>>>
>>>> Thanks for the detailed reply. Two quick questions/comments,
>>>>
>>>> 1. For "task", you mean a specific mapper instance, or a specific
>>>> reducer instance?
>>>>
>>>>
>>>> Either.
>>>>
>>>> 2. "However, I do not believe that a separate Task could connect with
>>>> the JT and see if the counter exists or if it could get a value or even an
>>>> accurate value since the updates are asynchronous." -- do you mean if a
>>>> mapper is updating custom counter ABC, and another mapper is updating the
>>>> same customer counter ABC, their counter values are updated independently
>>>> by different mappers, and will not published (aggregated) externally until
>>>> job completed successfully?
>>>>
>>>> I meant that if a Task created and updated a counter, a different Task
>>>> has access to that counter.
>>>>
>>>> To give you an example, if I want to count the number of quality errors
>>>> and then fail after X number of errors, I can't use Global counters to do
>>>> this.
>>>>
>>>> regards,
>>>> Lin
>>>>
>>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <
>>>> michael_segel@hotmail.com> wrote:
>>>>
>>>>> As I understand it... each Task has its own counters and are
>>>>> independently updated. As they report back to the JT, they update the
>>>>> counter(s)' status.
>>>>> The JT then will aggregate them.
>>>>>
>>>>> In terms of performance, Counters take up some memory in the JT so
>>>>> while its OK to use them, if you abuse them, you can run in to issues.
>>>>> As to limits... I guess that will depend on the amount of memory on
>>>>> the JT machine, the size of the cluster (Number of TT) and the number of
>>>>> counters.
>>>>>
>>>>> In terms of global accessibility... Maybe.
>>>>>
>>>>> The reason I say maybe is that I'm not sure by what you mean by
>>>>> globally accessible.
>>>>> If a task creates and implements a dynamic counter... I know that it
>>>>> will eventually be reflected in the JT. However, I do not believe that a
>>>>> separate Task could connect with the JT and see if the counter exists or if
>>>>> it could get a value or even an accurate value since the updates are
>>>>> asynchronous.  Not to mention that I don't believe that the counters are
>>>>> aggregated until the job ends. It would make sense that the JT maintains a
>>>>> unique counter for each task until the tasks complete. (If a task fails, it
>>>>> would have to delete the counters so that when the task is restarted the
>>>>> correct count is maintained. )  Note, I haven't looked at the source code
>>>>> so I am probably wrong.
>>>>>
>>>>> HTH
>>>>> Mike
>>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>>>
>>>>> Hi guys,
>>>>>
>>>>> I have some quick questions regarding to Hadoop counter,
>>>>>
>>>>>
>>>>>    - Hadoop counter (customer defined) is global accessible (for both
>>>>>    read and write) for all Mappers and Reducers in a job?
>>>>>    - What is the performance and best practices of using Hadoop
>>>>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>>>>    performance downgrade to the whole job?
>>>>>
>>>>> regards,
>>>>> Lin
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Java heap space error

Posted by Subash D'Souza <sd...@truecar.com>.

I'm running CDH 4 on  a 4 node cluster each with 96 G of RAM. Up until last week the cluster was running until there was an error in the name node log file and I had to reformat it put the data back

Now when I run hive on YARN. I keep getting a Java heap space error. Based on the research I did. I upped the my mapred.child.java.opts first from 200m to 400 m to 800m and I still have the same issue. It seems to fail near the 100% mapper mark

I checked the log files and the only thing that it does output is java heap space error. Nothing more.

Any help would be appreciated.

Thanks
Subash

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Thanks for the help so much, Mike. I learned a lot from this discussion.

So, the conclusion I learned from the discussion should be, since how/when
JT merge counter in the middle of the process of a job is undefined and
internal behavior, it is more reliable to read counter after the whole job
completes? Agree?

regards,
Lin

On Sun, Oct 21, 2012 at 8:15 PM, Michael Segel <mi...@hotmail.com>wrote:

>
> On Oct 21, 2012, at 1:45 AM, Lin Ma <li...@gmail.com> wrote:
>
> Thanks for the detailed reply, Mike. Yes, my most confusion is resolved by
> you. The last two questions (or comments) are used to confirm my
> understanding is correct,
>
> - is it normal use case or best practices for a job to consume/read the
> counters from previous completed job in an automatic way? I ask this
> because I am not sure whether the most use case of counter is human read
> and manual analysis, other then using another job to automatic consume the
> counters?
>
>
> Lin,
> Every job has a set of counters to maintain job statistics.
> This is specifically for human analysis and to help understand what
> happened with your job.
> It allows you to see how much data is read in by the job, how many records
> processed to be measured against how long the job took to complete.  It
> also showed you how much data is written back out.
>
> In addition to this,  a set of use cases for counters in Hadoop center on
> quality control. Its normal to chain jobs together to form a job flow.
> A typical use case for Hadoop is to pull data from various sources,
> combine them and do some process on them, resulting in a data set that gets
> sent to another system for visualization.
>
> In this use case, there are usually data cleansing and validation jobs. As
> they run, its possible to track a number of defective records. At the end
> of that specific job, from the ToolRunner, or whichever job class you used
> to launch your job, you can then get these aggregated counters for the job
> and determine if the process passed or failed.  Based on this, you can exit
> your program with either a success or failed flag.  Job Flow control tools
> like Oozie can capture this and then decide to continue or to stop and
> alert an operator of an error.
>
> - I want to confirm my understanding is correct, when each task completes,
> JT will aggregate/update the global counter values from the specific
> counter values updated by the complete task, but never expose global
> counters values until job completes? If it is correct, I am wondering why
> JT doing aggregation each time when a task completes, other than doing a
> one time aggregation when the job completes? Is there any design choice
> reasons? thanks.
>
>
> That's a good question. I haven't looked at the code, so I can't say
> definitively when the JT performs its aggregation. However, as the job runs
> and in process, we can look at the job tracker web page(s) and see the
> counter summary. This would imply that there has to be some aggregation
> occurring mid-flight. (It would be trivial to sum the list of counters
> periodically to update the job statistics.)  Note too that if the JT web
> pages can show a counter, its possible to then write a monitoring tool that
> can monitor the job while running and then kill the job mid flight if a
> certain threshold of a counter is met.
>
> That is to say you could in theory write a monitoring process and watch
> the counters. If lets say an error counter hits a predetermined threshold,
> you could then issue a 'hadoop job -kill <job-id>' command.
>
>
> regards,
> Lin
>
> On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>>
>> On Oct 19, 2012, at 10:27 PM, Lin Ma <li...@gmail.com> wrote:
>>
>> Thanks for the detailed reply Mike, I learned a lot from the discussion.
>>
>> - I just want to confirm with you that, supposing in the same job, when a
>> specific task completed (and counter is aggregated in JT after the task
>> completed from our discussion?), the other running task in the same job
>> cannot get the updated counter value from the previous completed task? I am
>> asking this because I am thinking whether I can use counter to share a
>> global value between tasks.
>>
>>
>> Yes that is correct.
>> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy
>> way for a task to query the job tracker. This might have changed in YARN
>>
>> - If so, what is the traditional use case of counter, only use counter
>> values after the whole job completes?
>>
>> Yes the counters are used to provide data at the end of the job...
>>
>> BTW: appreciate if you could share me a few use cases from your
>> experience about how counters are used.
>>
>> Well you have your typical job data like the number of records processed,
>> total number of bytes read,  bytes written...
>>
>> But suppose you wanted to do some quality control on your input.
>> So you need to keep a track on the count of bad records.  If this job is
>> part of a process, you may want to include business logic in your job to
>> halt the job flow if X% of the records contain bad data.
>>
>> Or your process takes input records and in processing them, they sort the
>> records based on some characteristic and you want to count those sorted
>> records as you processed them.
>>
>> For a more concrete example, the Illinois Tollway has these 'fast pass'
>> lanes where cars equipped with RFID tags can have the tolls automatically
>> deducted from their accounts rather than pay the toll manually each time.
>>
>> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are
>> cheaters where they drive through the sensor and the sensor doesn't capture
>> the RFID tag. (Note its possible that you have a false positive where the
>> car has an RFID chip but doesn't trip the sensor.) Pushing the data in a
>> map/reduce job would require the use of counters.
>>
>> Does that help?
>>
>> -Mike
>>
>> regards,
>> Lin
>>
>> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <michael_segel@hotmail.com
>> > wrote:
>>
>>> Yeah, sorry...
>>>
>>> I meant that if you were dynamically creating a counter foo in the
>>> Mapper task, then each mapper would be creating their own counter foo.
>>> As the job runs, these counters will eventually be sent up to the JT.
>>> The job tracker would keep a separate counter for each task.
>>>
>>> At the end, the final count is aggregated from the list of counters for
>>> foo.
>>>
>>>
>>> I don't know how you can get a task to ask information from the Job
>>> Tracker on how things are going in other tasks.  That is what I meant that
>>> you couldn't get information about the other counters or even the status of
>>> the other tasks running in the same job.
>>>
>>> I didn't see anything in the APIs that allowed for that type of flow...
>>> Of course having said that... someone pops up with a way to do just that.
>>> ;-)
>>>
>>>
>>> Does that clarify things?
>>>
>>> -Mike
>>>
>>>
>>> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
>>>
>>> Hi Mike,
>>>
>>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>>>
>>> From your this statement "It would make sense that the JT maintains a
>>> unique counter for each task until the tasks complete." -- it seems each
>>> task cannot see counters from each other, since JT maintains a unique
>>> counter for each tasks;
>>>
>>> From your this comment "I meant that if a Task created and updated a
>>> counter, a different Task has access to that counter. " -- it seems
>>> different tasks could share/access the same counter.
>>>
>>> Appreciate if you could help to clarify a bit.
>>>
>>> regards,
>>> Lin
>>>
>>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>>
>>>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>>>>
>>>> Hi Mike,
>>>>
>>>> Thanks for the detailed reply. Two quick questions/comments,
>>>>
>>>> 1. For "task", you mean a specific mapper instance, or a specific
>>>> reducer instance?
>>>>
>>>>
>>>> Either.
>>>>
>>>> 2. "However, I do not believe that a separate Task could connect with
>>>> the JT and see if the counter exists or if it could get a value or even an
>>>> accurate value since the updates are asynchronous." -- do you mean if a
>>>> mapper is updating custom counter ABC, and another mapper is updating the
>>>> same customer counter ABC, their counter values are updated independently
>>>> by different mappers, and will not published (aggregated) externally until
>>>> job completed successfully?
>>>>
>>>> I meant that if a Task created and updated a counter, a different Task
>>>> has access to that counter.
>>>>
>>>> To give you an example, if I want to count the number of quality errors
>>>> and then fail after X number of errors, I can't use Global counters to do
>>>> this.
>>>>
>>>> regards,
>>>> Lin
>>>>
>>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <
>>>> michael_segel@hotmail.com> wrote:
>>>>
>>>>> As I understand it... each Task has its own counters and are
>>>>> independently updated. As they report back to the JT, they update the
>>>>> counter(s)' status.
>>>>> The JT then will aggregate them.
>>>>>
>>>>> In terms of performance, Counters take up some memory in the JT so
>>>>> while its OK to use them, if you abuse them, you can run in to issues.
>>>>> As to limits... I guess that will depend on the amount of memory on
>>>>> the JT machine, the size of the cluster (Number of TT) and the number of
>>>>> counters.
>>>>>
>>>>> In terms of global accessibility... Maybe.
>>>>>
>>>>> The reason I say maybe is that I'm not sure by what you mean by
>>>>> globally accessible.
>>>>> If a task creates and implements a dynamic counter... I know that it
>>>>> will eventually be reflected in the JT. However, I do not believe that a
>>>>> separate Task could connect with the JT and see if the counter exists or if
>>>>> it could get a value or even an accurate value since the updates are
>>>>> asynchronous.  Not to mention that I don't believe that the counters are
>>>>> aggregated until the job ends. It would make sense that the JT maintains a
>>>>> unique counter for each task until the tasks complete. (If a task fails, it
>>>>> would have to delete the counters so that when the task is restarted the
>>>>> correct count is maintained. )  Note, I haven't looked at the source code
>>>>> so I am probably wrong.
>>>>>
>>>>> HTH
>>>>> Mike
>>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>>>
>>>>> Hi guys,
>>>>>
>>>>> I have some quick questions regarding to Hadoop counter,
>>>>>
>>>>>
>>>>>    - Hadoop counter (customer defined) is global accessible (for both
>>>>>    read and write) for all Mappers and Reducers in a job?
>>>>>    - What is the performance and best practices of using Hadoop
>>>>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>>>>    performance downgrade to the whole job?
>>>>>
>>>>> regards,
>>>>> Lin
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: Hadoop counter

Posted by Michael Segel <mi...@hotmail.com>.

On Oct 21, 2012, at 1:45 AM, Lin Ma <li...@gmail.com> wrote:

> Thanks for the detailed reply, Mike. Yes, my most confusion is resolved by you. The last two questions (or comments) are used to confirm my understanding is correct,
> 
> - is it normal use case or best practices for a job to consume/read the counters from previous completed job in an automatic way? I ask this because I am not sure whether the most use case of counter is human read and manual analysis, other then using another job to automatic consume the counters?

Lin, 
Every job has a set of counters to maintain job statistics. 
This is specifically for human analysis and to help understand what happened with your job. 
It allows you to see how much data is read in by the job, how many records processed to be measured against how long the job took to complete.  It also showed you how much data is written back out.  

In addition to this,  a set of use cases for counters in Hadoop center on quality control. Its normal to chain jobs together to form a job flow. 
A typical use case for Hadoop is to pull data from various sources, combine them and do some process on them, resulting in a data set that gets sent to another system for visualization. 

In this use case, there are usually data cleansing and validation jobs. As they run, its possible to track a number of defective records. At the end of that specific job, from the ToolRunner, or whichever job class you used to launch your job, you can then get these aggregated counters for the job and determine if the process passed or failed.  Based on this, you can exit your program with either a success or failed flag.  Job Flow control tools like Oozie can capture this and then decide to continue or to stop and alert an operator of an error. 

> - I want to confirm my understanding is correct, when each task completes, JT will aggregate/update the global counter values from the specific counter values updated by the complete task, but never expose global counters values until job completes? If it is correct, I am wondering why JT doing aggregation each time when a task completes, other than doing a one time aggregation when the job completes? Is there any design choice reasons? thanks.

That's a good question. I haven't looked at the code, so I can't say definitively when the JT performs its aggregation. However, as the job runs and in process, we can look at the job tracker web page(s) and see the counter summary. This would imply that there has to be some aggregation occurring mid-flight. (It would be trivial to sum the list of counters periodically to update the job statistics.)  Note too that if the JT web pages can show a counter, its possible to then write a monitoring tool that can monitor the job while running and then kill the job mid flight if a certain threshold of a counter is met. 

That is to say you could in theory write a monitoring process and watch the counters. If lets say an error counter hits a predetermined threshold, you could then issue a 'hadoop job -kill <job-id>' command. 

> 
> regards,
> Lin
> 
> On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel <mi...@hotmail.com> wrote:
> 
> On Oct 19, 2012, at 10:27 PM, Lin Ma <li...@gmail.com> wrote:
> 
>> Thanks for the detailed reply Mike, I learned a lot from the discussion.
>> 
>> - I just want to confirm with you that, supposing in the same job, when a specific task completed (and counter is aggregated in JT after the task completed from our discussion?), the other running task in the same job cannot get the updated counter value from the previous completed task? I am asking this because I am thinking whether I can use counter to share a global value between tasks.
> 
> Yes that is correct. 
> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy way for a task to query the job tracker. This might have changed in YARN
> 
>> - If so, what is the traditional use case of counter, only use counter values after the whole job completes?
>> 
> Yes the counters are used to provide data at the end of the job... 
> 
>> BTW: appreciate if you could share me a few use cases from your experience about how counters are used.
>> 
> Well you have your typical job data like the number of records processed, total number of bytes read,  bytes written... 
> 
> But suppose you wanted to do some quality control on your input. 
> So you need to keep a track on the count of bad records.  If this job is part of a process, you may want to include business logic in your job to halt the job flow if X% of the records contain bad data. 
> 
> Or your process takes input records and in processing them, they sort the records based on some characteristic and you want to count those sorted records as you processed them. 
> 
> For a more concrete example, the Illinois Tollway has these 'fast pass' lanes where cars equipped with RFID tags can have the tolls automatically deducted from their accounts rather than pay the toll manually each time. 
> 
> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are cheaters where they drive through the sensor and the sensor doesn't capture the RFID tag. (Note its possible that you have a false positive where the car has an RFID chip but doesn't trip the sensor.) Pushing the data in a map/reduce job would require the use of counters.
> 
> Does that help? 
> 
> -Mike
> 
>> regards,
>> Lin
>> 
>> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <mi...@hotmail.com> wrote:
>> Yeah, sorry... 
>> 
>> I meant that if you were dynamically creating a counter foo in the Mapper task, then each mapper would be creating their own counter foo. 
>> As the job runs, these counters will eventually be sent up to the JT. The job tracker would keep a separate counter for each task. 
>> 
>> At the end, the final count is aggregated from the list of counters for foo. 
>> 
>> 
>> I don't know how you can get a task to ask information from the Job Tracker on how things are going in other tasks.  That is what I meant that you couldn't get information about the other counters or even the status of the other tasks running in the same job. 
>> 
>> I didn't see anything in the APIs that allowed for that type of flow... Of course having said that... someone pops up with a way to do just that. ;-) 
>> 
>> 
>> Does that clarify things? 
>> 
>> -Mike
>> 
>> 
>> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
>> 
>>> Hi Mike,
>>> 
>>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>>> 
>>> From your this statement "It would make sense that the JT maintains a unique counter for each task until the tasks complete." -- it seems each task cannot see counters from each other, since JT maintains a unique counter for each tasks;
>>> 
>>> From your this comment "I meant that if a Task created and updated a counter, a different Task has access to that counter. " -- it seems different tasks could share/access the same counter.
>>> 
>>> Appreciate if you could help to clarify a bit.
>>> 
>>> regards,
>>> Lin
>>> 
>>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <mi...@hotmail.com> wrote:
>>> 
>>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>>> 
>>>> Hi Mike,
>>>> 
>>>> Thanks for the detailed reply. Two quick questions/comments,
>>>> 
>>>> 1. For "task", you mean a specific mapper instance, or a specific reducer instance?
>>> 
>>> Either. 
>>> 
>>>> 2. "However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous." -- do you mean if a mapper is updating custom counter ABC, and another mapper is updating the same customer counter ABC, their counter values are updated independently by different mappers, and will not published (aggregated) externally until job completed successfully?
>>>> 
>>> I meant that if a Task created and updated a counter, a different Task has access to that counter. 
>>> 
>>> To give you an example, if I want to count the number of quality errors and then fail after X number of errors, I can't use Global counters to do this.
>>> 
>>>> regards,
>>>> Lin
>>>> 
>>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <mi...@hotmail.com> wrote:
>>>> As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status.
>>>> The JT then will aggregate them. 
>>>> 
>>>> In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them, you can run in to issues. 
>>>> As to limits... I guess that will depend on the amount of memory on the JT machine, the size of the cluster (Number of TT) and the number of counters. 
>>>> 
>>>> In terms of global accessibility... Maybe.
>>>> 
>>>> The reason I say maybe is that I'm not sure by what you mean by globally accessible. 
>>>> If a task creates and implements a dynamic counter... I know that it will eventually be reflected in the JT. However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous.  Not to mention that I don't believe that the counters are aggregated until the job ends. It would make sense that the JT maintains a unique counter for each task until the tasks complete. (If a task fails, it would have to delete the counters so that when the task is restarted the correct count is maintained. )  Note, I haven't looked at the source code so I am probably wrong. 
>>>> 
>>>> HTH
>>>> Mike
>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>> 
>>>>> Hi guys,
>>>>> 
>>>>> I have some quick questions regarding to Hadoop counter,
>>>>> 
>>>>> Hadoop counter (customer defined) is global accessible (for both read and write) for all Mappers and Reducers in a job?
>>>>> What is the performance and best practices of using Hadoop counters? I am not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole job?
>>>>> regards,
>>>>> Lin
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: Hadoop counter

Posted by Michael Segel <mi...@hotmail.com>.

On Oct 21, 2012, at 1:45 AM, Lin Ma <li...@gmail.com> wrote:

> Thanks for the detailed reply, Mike. Yes, my most confusion is resolved by you. The last two questions (or comments) are used to confirm my understanding is correct,
> 
> - is it normal use case or best practices for a job to consume/read the counters from previous completed job in an automatic way? I ask this because I am not sure whether the most use case of counter is human read and manual analysis, other then using another job to automatic consume the counters?

Lin, 
Every job has a set of counters to maintain job statistics. 
This is specifically for human analysis and to help understand what happened with your job. 
It allows you to see how much data is read in by the job, how many records processed to be measured against how long the job took to complete.  It also showed you how much data is written back out.  

In addition to this,  a set of use cases for counters in Hadoop center on quality control. Its normal to chain jobs together to form a job flow. 
A typical use case for Hadoop is to pull data from various sources, combine them and do some process on them, resulting in a data set that gets sent to another system for visualization. 

In this use case, there are usually data cleansing and validation jobs. As they run, its possible to track a number of defective records. At the end of that specific job, from the ToolRunner, or whichever job class you used to launch your job, you can then get these aggregated counters for the job and determine if the process passed or failed.  Based on this, you can exit your program with either a success or failed flag.  Job Flow control tools like Oozie can capture this and then decide to continue or to stop and alert an operator of an error. 

> - I want to confirm my understanding is correct, when each task completes, JT will aggregate/update the global counter values from the specific counter values updated by the complete task, but never expose global counters values until job completes? If it is correct, I am wondering why JT doing aggregation each time when a task completes, other than doing a one time aggregation when the job completes? Is there any design choice reasons? thanks.

That's a good question. I haven't looked at the code, so I can't say definitively when the JT performs its aggregation. However, as the job runs and in process, we can look at the job tracker web page(s) and see the counter summary. This would imply that there has to be some aggregation occurring mid-flight. (It would be trivial to sum the list of counters periodically to update the job statistics.)  Note too that if the JT web pages can show a counter, its possible to then write a monitoring tool that can monitor the job while running and then kill the job mid flight if a certain threshold of a counter is met. 

That is to say you could in theory write a monitoring process and watch the counters. If lets say an error counter hits a predetermined threshold, you could then issue a 'hadoop job -kill <job-id>' command. 

> 
> regards,
> Lin
> 
> On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel <mi...@hotmail.com> wrote:
> 
> On Oct 19, 2012, at 10:27 PM, Lin Ma <li...@gmail.com> wrote:
> 
>> Thanks for the detailed reply Mike, I learned a lot from the discussion.
>> 
>> - I just want to confirm with you that, supposing in the same job, when a specific task completed (and counter is aggregated in JT after the task completed from our discussion?), the other running task in the same job cannot get the updated counter value from the previous completed task? I am asking this because I am thinking whether I can use counter to share a global value between tasks.
> 
> Yes that is correct. 
> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy way for a task to query the job tracker. This might have changed in YARN
> 
>> - If so, what is the traditional use case of counter, only use counter values after the whole job completes?
>> 
> Yes the counters are used to provide data at the end of the job... 
> 
>> BTW: appreciate if you could share me a few use cases from your experience about how counters are used.
>> 
> Well you have your typical job data like the number of records processed, total number of bytes read,  bytes written... 
> 
> But suppose you wanted to do some quality control on your input. 
> So you need to keep a track on the count of bad records.  If this job is part of a process, you may want to include business logic in your job to halt the job flow if X% of the records contain bad data. 
> 
> Or your process takes input records and in processing them, they sort the records based on some characteristic and you want to count those sorted records as you processed them. 
> 
> For a more concrete example, the Illinois Tollway has these 'fast pass' lanes where cars equipped with RFID tags can have the tolls automatically deducted from their accounts rather than pay the toll manually each time. 
> 
> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are cheaters where they drive through the sensor and the sensor doesn't capture the RFID tag. (Note its possible that you have a false positive where the car has an RFID chip but doesn't trip the sensor.) Pushing the data in a map/reduce job would require the use of counters.
> 
> Does that help? 
> 
> -Mike
> 
>> regards,
>> Lin
>> 
>> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <mi...@hotmail.com> wrote:
>> Yeah, sorry... 
>> 
>> I meant that if you were dynamically creating a counter foo in the Mapper task, then each mapper would be creating their own counter foo. 
>> As the job runs, these counters will eventually be sent up to the JT. The job tracker would keep a separate counter for each task. 
>> 
>> At the end, the final count is aggregated from the list of counters for foo. 
>> 
>> 
>> I don't know how you can get a task to ask information from the Job Tracker on how things are going in other tasks.  That is what I meant that you couldn't get information about the other counters or even the status of the other tasks running in the same job. 
>> 
>> I didn't see anything in the APIs that allowed for that type of flow... Of course having said that... someone pops up with a way to do just that. ;-) 
>> 
>> 
>> Does that clarify things? 
>> 
>> -Mike
>> 
>> 
>> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
>> 
>>> Hi Mike,
>>> 
>>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>>> 
>>> From your this statement "It would make sense that the JT maintains a unique counter for each task until the tasks complete." -- it seems each task cannot see counters from each other, since JT maintains a unique counter for each tasks;
>>> 
>>> From your this comment "I meant that if a Task created and updated a counter, a different Task has access to that counter. " -- it seems different tasks could share/access the same counter.
>>> 
>>> Appreciate if you could help to clarify a bit.
>>> 
>>> regards,
>>> Lin
>>> 
>>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <mi...@hotmail.com> wrote:
>>> 
>>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>>> 
>>>> Hi Mike,
>>>> 
>>>> Thanks for the detailed reply. Two quick questions/comments,
>>>> 
>>>> 1. For "task", you mean a specific mapper instance, or a specific reducer instance?
>>> 
>>> Either. 
>>> 
>>>> 2. "However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous." -- do you mean if a mapper is updating custom counter ABC, and another mapper is updating the same customer counter ABC, their counter values are updated independently by different mappers, and will not published (aggregated) externally until job completed successfully?
>>>> 
>>> I meant that if a Task created and updated a counter, a different Task has access to that counter. 
>>> 
>>> To give you an example, if I want to count the number of quality errors and then fail after X number of errors, I can't use Global counters to do this.
>>> 
>>>> regards,
>>>> Lin
>>>> 
>>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <mi...@hotmail.com> wrote:
>>>> As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status.
>>>> The JT then will aggregate them. 
>>>> 
>>>> In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them, you can run in to issues. 
>>>> As to limits... I guess that will depend on the amount of memory on the JT machine, the size of the cluster (Number of TT) and the number of counters. 
>>>> 
>>>> In terms of global accessibility... Maybe.
>>>> 
>>>> The reason I say maybe is that I'm not sure by what you mean by globally accessible. 
>>>> If a task creates and implements a dynamic counter... I know that it will eventually be reflected in the JT. However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous.  Not to mention that I don't believe that the counters are aggregated until the job ends. It would make sense that the JT maintains a unique counter for each task until the tasks complete. (If a task fails, it would have to delete the counters so that when the task is restarted the correct count is maintained. )  Note, I haven't looked at the source code so I am probably wrong. 
>>>> 
>>>> HTH
>>>> Mike
>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>> 
>>>>> Hi guys,
>>>>> 
>>>>> I have some quick questions regarding to Hadoop counter,
>>>>> 
>>>>> Hadoop counter (customer defined) is global accessible (for both read and write) for all Mappers and Reducers in a job?
>>>>> What is the performance and best practices of using Hadoop counters? I am not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole job?
>>>>> regards,
>>>>> Lin
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: Hadoop counter

Posted by Michael Segel <mi...@hotmail.com>.

On Oct 21, 2012, at 1:45 AM, Lin Ma <li...@gmail.com> wrote:

> Thanks for the detailed reply, Mike. Yes, my most confusion is resolved by you. The last two questions (or comments) are used to confirm my understanding is correct,
> 
> - is it normal use case or best practices for a job to consume/read the counters from previous completed job in an automatic way? I ask this because I am not sure whether the most use case of counter is human read and manual analysis, other then using another job to automatic consume the counters?

Lin, 
Every job has a set of counters to maintain job statistics. 
This is specifically for human analysis and to help understand what happened with your job. 
It allows you to see how much data is read in by the job, how many records processed to be measured against how long the job took to complete.  It also showed you how much data is written back out.  

In addition to this,  a set of use cases for counters in Hadoop center on quality control. Its normal to chain jobs together to form a job flow. 
A typical use case for Hadoop is to pull data from various sources, combine them and do some process on them, resulting in a data set that gets sent to another system for visualization. 

In this use case, there are usually data cleansing and validation jobs. As they run, its possible to track a number of defective records. At the end of that specific job, from the ToolRunner, or whichever job class you used to launch your job, you can then get these aggregated counters for the job and determine if the process passed or failed.  Based on this, you can exit your program with either a success or failed flag.  Job Flow control tools like Oozie can capture this and then decide to continue or to stop and alert an operator of an error. 

> - I want to confirm my understanding is correct, when each task completes, JT will aggregate/update the global counter values from the specific counter values updated by the complete task, but never expose global counters values until job completes? If it is correct, I am wondering why JT doing aggregation each time when a task completes, other than doing a one time aggregation when the job completes? Is there any design choice reasons? thanks.

That's a good question. I haven't looked at the code, so I can't say definitively when the JT performs its aggregation. However, as the job runs and in process, we can look at the job tracker web page(s) and see the counter summary. This would imply that there has to be some aggregation occurring mid-flight. (It would be trivial to sum the list of counters periodically to update the job statistics.)  Note too that if the JT web pages can show a counter, its possible to then write a monitoring tool that can monitor the job while running and then kill the job mid flight if a certain threshold of a counter is met. 

That is to say you could in theory write a monitoring process and watch the counters. If lets say an error counter hits a predetermined threshold, you could then issue a 'hadoop job -kill <job-id>' command. 

> 
> regards,
> Lin
> 
> On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel <mi...@hotmail.com> wrote:
> 
> On Oct 19, 2012, at 10:27 PM, Lin Ma <li...@gmail.com> wrote:
> 
>> Thanks for the detailed reply Mike, I learned a lot from the discussion.
>> 
>> - I just want to confirm with you that, supposing in the same job, when a specific task completed (and counter is aggregated in JT after the task completed from our discussion?), the other running task in the same job cannot get the updated counter value from the previous completed task? I am asking this because I am thinking whether I can use counter to share a global value between tasks.
> 
> Yes that is correct. 
> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy way for a task to query the job tracker. This might have changed in YARN
> 
>> - If so, what is the traditional use case of counter, only use counter values after the whole job completes?
>> 
> Yes the counters are used to provide data at the end of the job... 
> 
>> BTW: appreciate if you could share me a few use cases from your experience about how counters are used.
>> 
> Well you have your typical job data like the number of records processed, total number of bytes read,  bytes written... 
> 
> But suppose you wanted to do some quality control on your input. 
> So you need to keep a track on the count of bad records.  If this job is part of a process, you may want to include business logic in your job to halt the job flow if X% of the records contain bad data. 
> 
> Or your process takes input records and in processing them, they sort the records based on some characteristic and you want to count those sorted records as you processed them. 
> 
> For a more concrete example, the Illinois Tollway has these 'fast pass' lanes where cars equipped with RFID tags can have the tolls automatically deducted from their accounts rather than pay the toll manually each time. 
> 
> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are cheaters where they drive through the sensor and the sensor doesn't capture the RFID tag. (Note its possible that you have a false positive where the car has an RFID chip but doesn't trip the sensor.) Pushing the data in a map/reduce job would require the use of counters.
> 
> Does that help? 
> 
> -Mike
> 
>> regards,
>> Lin
>> 
>> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <mi...@hotmail.com> wrote:
>> Yeah, sorry... 
>> 
>> I meant that if you were dynamically creating a counter foo in the Mapper task, then each mapper would be creating their own counter foo. 
>> As the job runs, these counters will eventually be sent up to the JT. The job tracker would keep a separate counter for each task. 
>> 
>> At the end, the final count is aggregated from the list of counters for foo. 
>> 
>> 
>> I don't know how you can get a task to ask information from the Job Tracker on how things are going in other tasks.  That is what I meant that you couldn't get information about the other counters or even the status of the other tasks running in the same job. 
>> 
>> I didn't see anything in the APIs that allowed for that type of flow... Of course having said that... someone pops up with a way to do just that. ;-) 
>> 
>> 
>> Does that clarify things? 
>> 
>> -Mike
>> 
>> 
>> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
>> 
>>> Hi Mike,
>>> 
>>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>>> 
>>> From your this statement "It would make sense that the JT maintains a unique counter for each task until the tasks complete." -- it seems each task cannot see counters from each other, since JT maintains a unique counter for each tasks;
>>> 
>>> From your this comment "I meant that if a Task created and updated a counter, a different Task has access to that counter. " -- it seems different tasks could share/access the same counter.
>>> 
>>> Appreciate if you could help to clarify a bit.
>>> 
>>> regards,
>>> Lin
>>> 
>>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <mi...@hotmail.com> wrote:
>>> 
>>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>>> 
>>>> Hi Mike,
>>>> 
>>>> Thanks for the detailed reply. Two quick questions/comments,
>>>> 
>>>> 1. For "task", you mean a specific mapper instance, or a specific reducer instance?
>>> 
>>> Either. 
>>> 
>>>> 2. "However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous." -- do you mean if a mapper is updating custom counter ABC, and another mapper is updating the same customer counter ABC, their counter values are updated independently by different mappers, and will not published (aggregated) externally until job completed successfully?
>>>> 
>>> I meant that if a Task created and updated a counter, a different Task has access to that counter. 
>>> 
>>> To give you an example, if I want to count the number of quality errors and then fail after X number of errors, I can't use Global counters to do this.
>>> 
>>>> regards,
>>>> Lin
>>>> 
>>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <mi...@hotmail.com> wrote:
>>>> As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status.
>>>> The JT then will aggregate them. 
>>>> 
>>>> In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them, you can run in to issues. 
>>>> As to limits... I guess that will depend on the amount of memory on the JT machine, the size of the cluster (Number of TT) and the number of counters. 
>>>> 
>>>> In terms of global accessibility... Maybe.
>>>> 
>>>> The reason I say maybe is that I'm not sure by what you mean by globally accessible. 
>>>> If a task creates and implements a dynamic counter... I know that it will eventually be reflected in the JT. However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous.  Not to mention that I don't believe that the counters are aggregated until the job ends. It would make sense that the JT maintains a unique counter for each task until the tasks complete. (If a task fails, it would have to delete the counters so that when the task is restarted the correct count is maintained. )  Note, I haven't looked at the source code so I am probably wrong. 
>>>> 
>>>> HTH
>>>> Mike
>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>> 
>>>>> Hi guys,
>>>>> 
>>>>> I have some quick questions regarding to Hadoop counter,
>>>>> 
>>>>> Hadoop counter (customer defined) is global accessible (for both read and write) for all Mappers and Reducers in a job?
>>>>> What is the performance and best practices of using Hadoop counters? I am not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole job?
>>>>> regards,
>>>>> Lin
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: Hadoop counter

Posted by Michael Segel <mi...@hotmail.com>.

On Oct 21, 2012, at 1:45 AM, Lin Ma <li...@gmail.com> wrote:

> Thanks for the detailed reply, Mike. Yes, my most confusion is resolved by you. The last two questions (or comments) are used to confirm my understanding is correct,
> 
> - is it normal use case or best practices for a job to consume/read the counters from previous completed job in an automatic way? I ask this because I am not sure whether the most use case of counter is human read and manual analysis, other then using another job to automatic consume the counters?

Lin, 
Every job has a set of counters to maintain job statistics. 
This is specifically for human analysis and to help understand what happened with your job. 
It allows you to see how much data is read in by the job, how many records processed to be measured against how long the job took to complete.  It also showed you how much data is written back out.  

In addition to this,  a set of use cases for counters in Hadoop center on quality control. Its normal to chain jobs together to form a job flow. 
A typical use case for Hadoop is to pull data from various sources, combine them and do some process on them, resulting in a data set that gets sent to another system for visualization. 

In this use case, there are usually data cleansing and validation jobs. As they run, its possible to track a number of defective records. At the end of that specific job, from the ToolRunner, or whichever job class you used to launch your job, you can then get these aggregated counters for the job and determine if the process passed or failed.  Based on this, you can exit your program with either a success or failed flag.  Job Flow control tools like Oozie can capture this and then decide to continue or to stop and alert an operator of an error. 

> - I want to confirm my understanding is correct, when each task completes, JT will aggregate/update the global counter values from the specific counter values updated by the complete task, but never expose global counters values until job completes? If it is correct, I am wondering why JT doing aggregation each time when a task completes, other than doing a one time aggregation when the job completes? Is there any design choice reasons? thanks.

That's a good question. I haven't looked at the code, so I can't say definitively when the JT performs its aggregation. However, as the job runs and in process, we can look at the job tracker web page(s) and see the counter summary. This would imply that there has to be some aggregation occurring mid-flight. (It would be trivial to sum the list of counters periodically to update the job statistics.)  Note too that if the JT web pages can show a counter, its possible to then write a monitoring tool that can monitor the job while running and then kill the job mid flight if a certain threshold of a counter is met. 

That is to say you could in theory write a monitoring process and watch the counters. If lets say an error counter hits a predetermined threshold, you could then issue a 'hadoop job -kill <job-id>' command. 

> 
> regards,
> Lin
> 
> On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel <mi...@hotmail.com> wrote:
> 
> On Oct 19, 2012, at 10:27 PM, Lin Ma <li...@gmail.com> wrote:
> 
>> Thanks for the detailed reply Mike, I learned a lot from the discussion.
>> 
>> - I just want to confirm with you that, supposing in the same job, when a specific task completed (and counter is aggregated in JT after the task completed from our discussion?), the other running task in the same job cannot get the updated counter value from the previous completed task? I am asking this because I am thinking whether I can use counter to share a global value between tasks.
> 
> Yes that is correct. 
> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy way for a task to query the job tracker. This might have changed in YARN
> 
>> - If so, what is the traditional use case of counter, only use counter values after the whole job completes?
>> 
> Yes the counters are used to provide data at the end of the job... 
> 
>> BTW: appreciate if you could share me a few use cases from your experience about how counters are used.
>> 
> Well you have your typical job data like the number of records processed, total number of bytes read,  bytes written... 
> 
> But suppose you wanted to do some quality control on your input. 
> So you need to keep a track on the count of bad records.  If this job is part of a process, you may want to include business logic in your job to halt the job flow if X% of the records contain bad data. 
> 
> Or your process takes input records and in processing them, they sort the records based on some characteristic and you want to count those sorted records as you processed them. 
> 
> For a more concrete example, the Illinois Tollway has these 'fast pass' lanes where cars equipped with RFID tags can have the tolls automatically deducted from their accounts rather than pay the toll manually each time. 
> 
> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are cheaters where they drive through the sensor and the sensor doesn't capture the RFID tag. (Note its possible that you have a false positive where the car has an RFID chip but doesn't trip the sensor.) Pushing the data in a map/reduce job would require the use of counters.
> 
> Does that help? 
> 
> -Mike
> 
>> regards,
>> Lin
>> 
>> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <mi...@hotmail.com> wrote:
>> Yeah, sorry... 
>> 
>> I meant that if you were dynamically creating a counter foo in the Mapper task, then each mapper would be creating their own counter foo. 
>> As the job runs, these counters will eventually be sent up to the JT. The job tracker would keep a separate counter for each task. 
>> 
>> At the end, the final count is aggregated from the list of counters for foo. 
>> 
>> 
>> I don't know how you can get a task to ask information from the Job Tracker on how things are going in other tasks.  That is what I meant that you couldn't get information about the other counters or even the status of the other tasks running in the same job. 
>> 
>> I didn't see anything in the APIs that allowed for that type of flow... Of course having said that... someone pops up with a way to do just that. ;-) 
>> 
>> 
>> Does that clarify things? 
>> 
>> -Mike
>> 
>> 
>> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
>> 
>>> Hi Mike,
>>> 
>>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>>> 
>>> From your this statement "It would make sense that the JT maintains a unique counter for each task until the tasks complete." -- it seems each task cannot see counters from each other, since JT maintains a unique counter for each tasks;
>>> 
>>> From your this comment "I meant that if a Task created and updated a counter, a different Task has access to that counter. " -- it seems different tasks could share/access the same counter.
>>> 
>>> Appreciate if you could help to clarify a bit.
>>> 
>>> regards,
>>> Lin
>>> 
>>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <mi...@hotmail.com> wrote:
>>> 
>>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>>> 
>>>> Hi Mike,
>>>> 
>>>> Thanks for the detailed reply. Two quick questions/comments,
>>>> 
>>>> 1. For "task", you mean a specific mapper instance, or a specific reducer instance?
>>> 
>>> Either. 
>>> 
>>>> 2. "However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous." -- do you mean if a mapper is updating custom counter ABC, and another mapper is updating the same customer counter ABC, their counter values are updated independently by different mappers, and will not published (aggregated) externally until job completed successfully?
>>>> 
>>> I meant that if a Task created and updated a counter, a different Task has access to that counter. 
>>> 
>>> To give you an example, if I want to count the number of quality errors and then fail after X number of errors, I can't use Global counters to do this.
>>> 
>>>> regards,
>>>> Lin
>>>> 
>>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <mi...@hotmail.com> wrote:
>>>> As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status.
>>>> The JT then will aggregate them. 
>>>> 
>>>> In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them, you can run in to issues. 
>>>> As to limits... I guess that will depend on the amount of memory on the JT machine, the size of the cluster (Number of TT) and the number of counters. 
>>>> 
>>>> In terms of global accessibility... Maybe.
>>>> 
>>>> The reason I say maybe is that I'm not sure by what you mean by globally accessible. 
>>>> If a task creates and implements a dynamic counter... I know that it will eventually be reflected in the JT. However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous.  Not to mention that I don't believe that the counters are aggregated until the job ends. It would make sense that the JT maintains a unique counter for each task until the tasks complete. (If a task fails, it would have to delete the counters so that when the task is restarted the correct count is maintained. )  Note, I haven't looked at the source code so I am probably wrong. 
>>>> 
>>>> HTH
>>>> Mike
>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>> 
>>>>> Hi guys,
>>>>> 
>>>>> I have some quick questions regarding to Hadoop counter,
>>>>> 
>>>>> Hadoop counter (customer defined) is global accessible (for both read and write) for all Mappers and Reducers in a job?
>>>>> What is the performance and best practices of using Hadoop counters? I am not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole job?
>>>>> regards,
>>>>> Lin
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Thanks for the detailed reply, Mike. Yes, my most confusion is resolved by
you. The last two questions (or comments) are used to confirm my
understanding is correct,

- is it normal use case or best practices for a job to consume/read the
counters from previous completed job in an automatic way? I ask this
because I am not sure whether the most use case of counter is human read
and manual analysis, other then using another job to automatic consume the
counters?
- I want to confirm my understanding is correct, when each task completes,
JT will aggregate/update the global counter values from the specific
counter values updated by the complete task, but never expose global
counters values until job completes? If it is correct, I am wondering why
JT doing aggregation each time when a task completes, other than doing a
one time aggregation when the job completes? Is there any design choice
reasons? thanks.

regards,
Lin

On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel <mi...@hotmail.com>wrote:

>
> On Oct 19, 2012, at 10:27 PM, Lin Ma <li...@gmail.com> wrote:
>
> Thanks for the detailed reply Mike, I learned a lot from the discussion.
>
> - I just want to confirm with you that, supposing in the same job, when a
> specific task completed (and counter is aggregated in JT after the task
> completed from our discussion?), the other running task in the same job
> cannot get the updated counter value from the previous completed task? I am
> asking this because I am thinking whether I can use counter to share a
> global value between tasks.
>
>
> Yes that is correct.
> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy
> way for a task to query the job tracker. This might have changed in YARN
>
> - If so, what is the traditional use case of counter, only use counter
> values after the whole job completes?
>
> Yes the counters are used to provide data at the end of the job...
>
> BTW: appreciate if you could share me a few use cases from your experience
> about how counters are used.
>
> Well you have your typical job data like the number of records processed,
> total number of bytes read,  bytes written...
>
> But suppose you wanted to do some quality control on your input.
> So you need to keep a track on the count of bad records.  If this job is
> part of a process, you may want to include business logic in your job to
> halt the job flow if X% of the records contain bad data.
>
> Or your process takes input records and in processing them, they sort the
> records based on some characteristic and you want to count those sorted
> records as you processed them.
>
> For a more concrete example, the Illinois Tollway has these 'fast pass'
> lanes where cars equipped with RFID tags can have the tolls automatically
> deducted from their accounts rather than pay the toll manually each time.
>
> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are
> cheaters where they drive through the sensor and the sensor doesn't capture
> the RFID tag. (Note its possible that you have a false positive where the
> car has an RFID chip but doesn't trip the sensor.) Pushing the data in a
> map/reduce job would require the use of counters.
>
> Does that help?
>
> -Mike
>
> regards,
> Lin
>
> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <mi...@hotmail.com>wrote:
>
>> Yeah, sorry...
>>
>> I meant that if you were dynamically creating a counter foo in the Mapper
>> task, then each mapper would be creating their own counter foo.
>> As the job runs, these counters will eventually be sent up to the JT. The
>> job tracker would keep a separate counter for each task.
>>
>> At the end, the final count is aggregated from the list of counters for
>> foo.
>>
>>
>> I don't know how you can get a task to ask information from the Job
>> Tracker on how things are going in other tasks.  That is what I meant that
>> you couldn't get information about the other counters or even the status of
>> the other tasks running in the same job.
>>
>> I didn't see anything in the APIs that allowed for that type of flow...
>> Of course having said that... someone pops up with a way to do just that.
>> ;-)
>>
>>
>> Does that clarify things?
>>
>> -Mike
>>
>>
>> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
>>
>> Hi Mike,
>>
>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>>
>> From your this statement "It would make sense that the JT maintains a
>> unique counter for each task until the tasks complete." -- it seems each
>> task cannot see counters from each other, since JT maintains a unique
>> counter for each tasks;
>>
>> From your this comment "I meant that if a Task created and updated a
>> counter, a different Task has access to that counter. " -- it seems
>> different tasks could share/access the same counter.
>>
>> Appreciate if you could help to clarify a bit.
>>
>> regards,
>> Lin
>>
>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <
>> michael_segel@hotmail.com> wrote:
>>
>>>
>>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>>>
>>> Hi Mike,
>>>
>>> Thanks for the detailed reply. Two quick questions/comments,
>>>
>>> 1. For "task", you mean a specific mapper instance, or a specific
>>> reducer instance?
>>>
>>>
>>> Either.
>>>
>>> 2. "However, I do not believe that a separate Task could connect with
>>> the JT and see if the counter exists or if it could get a value or even an
>>> accurate value since the updates are asynchronous." -- do you mean if a
>>> mapper is updating custom counter ABC, and another mapper is updating the
>>> same customer counter ABC, their counter values are updated independently
>>> by different mappers, and will not published (aggregated) externally until
>>> job completed successfully?
>>>
>>> I meant that if a Task created and updated a counter, a different Task
>>> has access to that counter.
>>>
>>> To give you an example, if I want to count the number of quality errors
>>> and then fail after X number of errors, I can't use Global counters to do
>>> this.
>>>
>>> regards,
>>> Lin
>>>
>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>> As I understand it... each Task has its own counters and are
>>>> independently updated. As they report back to the JT, they update the
>>>> counter(s)' status.
>>>> The JT then will aggregate them.
>>>>
>>>> In terms of performance, Counters take up some memory in the JT so
>>>> while its OK to use them, if you abuse them, you can run in to issues.
>>>> As to limits... I guess that will depend on the amount of memory on the
>>>> JT machine, the size of the cluster (Number of TT) and the number of
>>>> counters.
>>>>
>>>> In terms of global accessibility... Maybe.
>>>>
>>>> The reason I say maybe is that I'm not sure by what you mean by
>>>> globally accessible.
>>>> If a task creates and implements a dynamic counter... I know that it
>>>> will eventually be reflected in the JT. However, I do not believe that a
>>>> separate Task could connect with the JT and see if the counter exists or if
>>>> it could get a value or even an accurate value since the updates are
>>>> asynchronous.  Not to mention that I don't believe that the counters are
>>>> aggregated until the job ends. It would make sense that the JT maintains a
>>>> unique counter for each task until the tasks complete. (If a task fails, it
>>>> would have to delete the counters so that when the task is restarted the
>>>> correct count is maintained. )  Note, I haven't looked at the source code
>>>> so I am probably wrong.
>>>>
>>>> HTH
>>>> Mike
>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>>
>>>> Hi guys,
>>>>
>>>> I have some quick questions regarding to Hadoop counter,
>>>>
>>>>
>>>>    - Hadoop counter (customer defined) is global accessible (for both
>>>>    read and write) for all Mappers and Reducers in a job?
>>>>    - What is the performance and best practices of using Hadoop
>>>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>>>    performance downgrade to the whole job?
>>>>
>>>> regards,
>>>> Lin
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Thanks for the detailed reply, Mike. Yes, my most confusion is resolved by
you. The last two questions (or comments) are used to confirm my
understanding is correct,

- is it normal use case or best practices for a job to consume/read the
counters from previous completed job in an automatic way? I ask this
because I am not sure whether the most use case of counter is human read
and manual analysis, other then using another job to automatic consume the
counters?
- I want to confirm my understanding is correct, when each task completes,
JT will aggregate/update the global counter values from the specific
counter values updated by the complete task, but never expose global
counters values until job completes? If it is correct, I am wondering why
JT doing aggregation each time when a task completes, other than doing a
one time aggregation when the job completes? Is there any design choice
reasons? thanks.

regards,
Lin

On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel <mi...@hotmail.com>wrote:

>
> On Oct 19, 2012, at 10:27 PM, Lin Ma <li...@gmail.com> wrote:
>
> Thanks for the detailed reply Mike, I learned a lot from the discussion.
>
> - I just want to confirm with you that, supposing in the same job, when a
> specific task completed (and counter is aggregated in JT after the task
> completed from our discussion?), the other running task in the same job
> cannot get the updated counter value from the previous completed task? I am
> asking this because I am thinking whether I can use counter to share a
> global value between tasks.
>
>
> Yes that is correct.
> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy
> way for a task to query the job tracker. This might have changed in YARN
>
> - If so, what is the traditional use case of counter, only use counter
> values after the whole job completes?
>
> Yes the counters are used to provide data at the end of the job...
>
> BTW: appreciate if you could share me a few use cases from your experience
> about how counters are used.
>
> Well you have your typical job data like the number of records processed,
> total number of bytes read,  bytes written...
>
> But suppose you wanted to do some quality control on your input.
> So you need to keep a track on the count of bad records.  If this job is
> part of a process, you may want to include business logic in your job to
> halt the job flow if X% of the records contain bad data.
>
> Or your process takes input records and in processing them, they sort the
> records based on some characteristic and you want to count those sorted
> records as you processed them.
>
> For a more concrete example, the Illinois Tollway has these 'fast pass'
> lanes where cars equipped with RFID tags can have the tolls automatically
> deducted from their accounts rather than pay the toll manually each time.
>
> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are
> cheaters where they drive through the sensor and the sensor doesn't capture
> the RFID tag. (Note its possible that you have a false positive where the
> car has an RFID chip but doesn't trip the sensor.) Pushing the data in a
> map/reduce job would require the use of counters.
>
> Does that help?
>
> -Mike
>
> regards,
> Lin
>
> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <mi...@hotmail.com>wrote:
>
>> Yeah, sorry...
>>
>> I meant that if you were dynamically creating a counter foo in the Mapper
>> task, then each mapper would be creating their own counter foo.
>> As the job runs, these counters will eventually be sent up to the JT. The
>> job tracker would keep a separate counter for each task.
>>
>> At the end, the final count is aggregated from the list of counters for
>> foo.
>>
>>
>> I don't know how you can get a task to ask information from the Job
>> Tracker on how things are going in other tasks.  That is what I meant that
>> you couldn't get information about the other counters or even the status of
>> the other tasks running in the same job.
>>
>> I didn't see anything in the APIs that allowed for that type of flow...
>> Of course having said that... someone pops up with a way to do just that.
>> ;-)
>>
>>
>> Does that clarify things?
>>
>> -Mike
>>
>>
>> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
>>
>> Hi Mike,
>>
>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>>
>> From your this statement "It would make sense that the JT maintains a
>> unique counter for each task until the tasks complete." -- it seems each
>> task cannot see counters from each other, since JT maintains a unique
>> counter for each tasks;
>>
>> From your this comment "I meant that if a Task created and updated a
>> counter, a different Task has access to that counter. " -- it seems
>> different tasks could share/access the same counter.
>>
>> Appreciate if you could help to clarify a bit.
>>
>> regards,
>> Lin
>>
>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <
>> michael_segel@hotmail.com> wrote:
>>
>>>
>>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>>>
>>> Hi Mike,
>>>
>>> Thanks for the detailed reply. Two quick questions/comments,
>>>
>>> 1. For "task", you mean a specific mapper instance, or a specific
>>> reducer instance?
>>>
>>>
>>> Either.
>>>
>>> 2. "However, I do not believe that a separate Task could connect with
>>> the JT and see if the counter exists or if it could get a value or even an
>>> accurate value since the updates are asynchronous." -- do you mean if a
>>> mapper is updating custom counter ABC, and another mapper is updating the
>>> same customer counter ABC, their counter values are updated independently
>>> by different mappers, and will not published (aggregated) externally until
>>> job completed successfully?
>>>
>>> I meant that if a Task created and updated a counter, a different Task
>>> has access to that counter.
>>>
>>> To give you an example, if I want to count the number of quality errors
>>> and then fail after X number of errors, I can't use Global counters to do
>>> this.
>>>
>>> regards,
>>> Lin
>>>
>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>> As I understand it... each Task has its own counters and are
>>>> independently updated. As they report back to the JT, they update the
>>>> counter(s)' status.
>>>> The JT then will aggregate them.
>>>>
>>>> In terms of performance, Counters take up some memory in the JT so
>>>> while its OK to use them, if you abuse them, you can run in to issues.
>>>> As to limits... I guess that will depend on the amount of memory on the
>>>> JT machine, the size of the cluster (Number of TT) and the number of
>>>> counters.
>>>>
>>>> In terms of global accessibility... Maybe.
>>>>
>>>> The reason I say maybe is that I'm not sure by what you mean by
>>>> globally accessible.
>>>> If a task creates and implements a dynamic counter... I know that it
>>>> will eventually be reflected in the JT. However, I do not believe that a
>>>> separate Task could connect with the JT and see if the counter exists or if
>>>> it could get a value or even an accurate value since the updates are
>>>> asynchronous.  Not to mention that I don't believe that the counters are
>>>> aggregated until the job ends. It would make sense that the JT maintains a
>>>> unique counter for each task until the tasks complete. (If a task fails, it
>>>> would have to delete the counters so that when the task is restarted the
>>>> correct count is maintained. )  Note, I haven't looked at the source code
>>>> so I am probably wrong.
>>>>
>>>> HTH
>>>> Mike
>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>>
>>>> Hi guys,
>>>>
>>>> I have some quick questions regarding to Hadoop counter,
>>>>
>>>>
>>>>    - Hadoop counter (customer defined) is global accessible (for both
>>>>    read and write) for all Mappers and Reducers in a job?
>>>>    - What is the performance and best practices of using Hadoop
>>>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>>>    performance downgrade to the whole job?
>>>>
>>>> regards,
>>>> Lin
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Thanks for the detailed reply, Mike. Yes, my most confusion is resolved by
you. The last two questions (or comments) are used to confirm my
understanding is correct,

- is it normal use case or best practices for a job to consume/read the
counters from previous completed job in an automatic way? I ask this
because I am not sure whether the most use case of counter is human read
and manual analysis, other then using another job to automatic consume the
counters?
- I want to confirm my understanding is correct, when each task completes,
JT will aggregate/update the global counter values from the specific
counter values updated by the complete task, but never expose global
counters values until job completes? If it is correct, I am wondering why
JT doing aggregation each time when a task completes, other than doing a
one time aggregation when the job completes? Is there any design choice
reasons? thanks.

regards,
Lin

On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel <mi...@hotmail.com>wrote:

>
> On Oct 19, 2012, at 10:27 PM, Lin Ma <li...@gmail.com> wrote:
>
> Thanks for the detailed reply Mike, I learned a lot from the discussion.
>
> - I just want to confirm with you that, supposing in the same job, when a
> specific task completed (and counter is aggregated in JT after the task
> completed from our discussion?), the other running task in the same job
> cannot get the updated counter value from the previous completed task? I am
> asking this because I am thinking whether I can use counter to share a
> global value between tasks.
>
>
> Yes that is correct.
> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy
> way for a task to query the job tracker. This might have changed in YARN
>
> - If so, what is the traditional use case of counter, only use counter
> values after the whole job completes?
>
> Yes the counters are used to provide data at the end of the job...
>
> BTW: appreciate if you could share me a few use cases from your experience
> about how counters are used.
>
> Well you have your typical job data like the number of records processed,
> total number of bytes read,  bytes written...
>
> But suppose you wanted to do some quality control on your input.
> So you need to keep a track on the count of bad records.  If this job is
> part of a process, you may want to include business logic in your job to
> halt the job flow if X% of the records contain bad data.
>
> Or your process takes input records and in processing them, they sort the
> records based on some characteristic and you want to count those sorted
> records as you processed them.
>
> For a more concrete example, the Illinois Tollway has these 'fast pass'
> lanes where cars equipped with RFID tags can have the tolls automatically
> deducted from their accounts rather than pay the toll manually each time.
>
> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are
> cheaters where they drive through the sensor and the sensor doesn't capture
> the RFID tag. (Note its possible that you have a false positive where the
> car has an RFID chip but doesn't trip the sensor.) Pushing the data in a
> map/reduce job would require the use of counters.
>
> Does that help?
>
> -Mike
>
> regards,
> Lin
>
> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <mi...@hotmail.com>wrote:
>
>> Yeah, sorry...
>>
>> I meant that if you were dynamically creating a counter foo in the Mapper
>> task, then each mapper would be creating their own counter foo.
>> As the job runs, these counters will eventually be sent up to the JT. The
>> job tracker would keep a separate counter for each task.
>>
>> At the end, the final count is aggregated from the list of counters for
>> foo.
>>
>>
>> I don't know how you can get a task to ask information from the Job
>> Tracker on how things are going in other tasks.  That is what I meant that
>> you couldn't get information about the other counters or even the status of
>> the other tasks running in the same job.
>>
>> I didn't see anything in the APIs that allowed for that type of flow...
>> Of course having said that... someone pops up with a way to do just that.
>> ;-)
>>
>>
>> Does that clarify things?
>>
>> -Mike
>>
>>
>> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
>>
>> Hi Mike,
>>
>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>>
>> From your this statement "It would make sense that the JT maintains a
>> unique counter for each task until the tasks complete." -- it seems each
>> task cannot see counters from each other, since JT maintains a unique
>> counter for each tasks;
>>
>> From your this comment "I meant that if a Task created and updated a
>> counter, a different Task has access to that counter. " -- it seems
>> different tasks could share/access the same counter.
>>
>> Appreciate if you could help to clarify a bit.
>>
>> regards,
>> Lin
>>
>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <
>> michael_segel@hotmail.com> wrote:
>>
>>>
>>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>>>
>>> Hi Mike,
>>>
>>> Thanks for the detailed reply. Two quick questions/comments,
>>>
>>> 1. For "task", you mean a specific mapper instance, or a specific
>>> reducer instance?
>>>
>>>
>>> Either.
>>>
>>> 2. "However, I do not believe that a separate Task could connect with
>>> the JT and see if the counter exists or if it could get a value or even an
>>> accurate value since the updates are asynchronous." -- do you mean if a
>>> mapper is updating custom counter ABC, and another mapper is updating the
>>> same customer counter ABC, their counter values are updated independently
>>> by different mappers, and will not published (aggregated) externally until
>>> job completed successfully?
>>>
>>> I meant that if a Task created and updated a counter, a different Task
>>> has access to that counter.
>>>
>>> To give you an example, if I want to count the number of quality errors
>>> and then fail after X number of errors, I can't use Global counters to do
>>> this.
>>>
>>> regards,
>>> Lin
>>>
>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>> As I understand it... each Task has its own counters and are
>>>> independently updated. As they report back to the JT, they update the
>>>> counter(s)' status.
>>>> The JT then will aggregate them.
>>>>
>>>> In terms of performance, Counters take up some memory in the JT so
>>>> while its OK to use them, if you abuse them, you can run in to issues.
>>>> As to limits... I guess that will depend on the amount of memory on the
>>>> JT machine, the size of the cluster (Number of TT) and the number of
>>>> counters.
>>>>
>>>> In terms of global accessibility... Maybe.
>>>>
>>>> The reason I say maybe is that I'm not sure by what you mean by
>>>> globally accessible.
>>>> If a task creates and implements a dynamic counter... I know that it
>>>> will eventually be reflected in the JT. However, I do not believe that a
>>>> separate Task could connect with the JT and see if the counter exists or if
>>>> it could get a value or even an accurate value since the updates are
>>>> asynchronous.  Not to mention that I don't believe that the counters are
>>>> aggregated until the job ends. It would make sense that the JT maintains a
>>>> unique counter for each task until the tasks complete. (If a task fails, it
>>>> would have to delete the counters so that when the task is restarted the
>>>> correct count is maintained. )  Note, I haven't looked at the source code
>>>> so I am probably wrong.
>>>>
>>>> HTH
>>>> Mike
>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>>
>>>> Hi guys,
>>>>
>>>> I have some quick questions regarding to Hadoop counter,
>>>>
>>>>
>>>>    - Hadoop counter (customer defined) is global accessible (for both
>>>>    read and write) for all Mappers and Reducers in a job?
>>>>    - What is the performance and best practices of using Hadoop
>>>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>>>    performance downgrade to the whole job?
>>>>
>>>> regards,
>>>> Lin
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Thanks for the detailed reply, Mike. Yes, my most confusion is resolved by
you. The last two questions (or comments) are used to confirm my
understanding is correct,

- is it normal use case or best practices for a job to consume/read the
counters from previous completed job in an automatic way? I ask this
because I am not sure whether the most use case of counter is human read
and manual analysis, other then using another job to automatic consume the
counters?
- I want to confirm my understanding is correct, when each task completes,
JT will aggregate/update the global counter values from the specific
counter values updated by the complete task, but never expose global
counters values until job completes? If it is correct, I am wondering why
JT doing aggregation each time when a task completes, other than doing a
one time aggregation when the job completes? Is there any design choice
reasons? thanks.

regards,
Lin

On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel <mi...@hotmail.com>wrote:

>
> On Oct 19, 2012, at 10:27 PM, Lin Ma <li...@gmail.com> wrote:
>
> Thanks for the detailed reply Mike, I learned a lot from the discussion.
>
> - I just want to confirm with you that, supposing in the same job, when a
> specific task completed (and counter is aggregated in JT after the task
> completed from our discussion?), the other running task in the same job
> cannot get the updated counter value from the previous completed task? I am
> asking this because I am thinking whether I can use counter to share a
> global value between tasks.
>
>
> Yes that is correct.
> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy
> way for a task to query the job tracker. This might have changed in YARN
>
> - If so, what is the traditional use case of counter, only use counter
> values after the whole job completes?
>
> Yes the counters are used to provide data at the end of the job...
>
> BTW: appreciate if you could share me a few use cases from your experience
> about how counters are used.
>
> Well you have your typical job data like the number of records processed,
> total number of bytes read,  bytes written...
>
> But suppose you wanted to do some quality control on your input.
> So you need to keep a track on the count of bad records.  If this job is
> part of a process, you may want to include business logic in your job to
> halt the job flow if X% of the records contain bad data.
>
> Or your process takes input records and in processing them, they sort the
> records based on some characteristic and you want to count those sorted
> records as you processed them.
>
> For a more concrete example, the Illinois Tollway has these 'fast pass'
> lanes where cars equipped with RFID tags can have the tolls automatically
> deducted from their accounts rather than pay the toll manually each time.
>
> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are
> cheaters where they drive through the sensor and the sensor doesn't capture
> the RFID tag. (Note its possible that you have a false positive where the
> car has an RFID chip but doesn't trip the sensor.) Pushing the data in a
> map/reduce job would require the use of counters.
>
> Does that help?
>
> -Mike
>
> regards,
> Lin
>
> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <mi...@hotmail.com>wrote:
>
>> Yeah, sorry...
>>
>> I meant that if you were dynamically creating a counter foo in the Mapper
>> task, then each mapper would be creating their own counter foo.
>> As the job runs, these counters will eventually be sent up to the JT. The
>> job tracker would keep a separate counter for each task.
>>
>> At the end, the final count is aggregated from the list of counters for
>> foo.
>>
>>
>> I don't know how you can get a task to ask information from the Job
>> Tracker on how things are going in other tasks.  That is what I meant that
>> you couldn't get information about the other counters or even the status of
>> the other tasks running in the same job.
>>
>> I didn't see anything in the APIs that allowed for that type of flow...
>> Of course having said that... someone pops up with a way to do just that.
>> ;-)
>>
>>
>> Does that clarify things?
>>
>> -Mike
>>
>>
>> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
>>
>> Hi Mike,
>>
>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>>
>> From your this statement "It would make sense that the JT maintains a
>> unique counter for each task until the tasks complete." -- it seems each
>> task cannot see counters from each other, since JT maintains a unique
>> counter for each tasks;
>>
>> From your this comment "I meant that if a Task created and updated a
>> counter, a different Task has access to that counter. " -- it seems
>> different tasks could share/access the same counter.
>>
>> Appreciate if you could help to clarify a bit.
>>
>> regards,
>> Lin
>>
>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <
>> michael_segel@hotmail.com> wrote:
>>
>>>
>>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>>>
>>> Hi Mike,
>>>
>>> Thanks for the detailed reply. Two quick questions/comments,
>>>
>>> 1. For "task", you mean a specific mapper instance, or a specific
>>> reducer instance?
>>>
>>>
>>> Either.
>>>
>>> 2. "However, I do not believe that a separate Task could connect with
>>> the JT and see if the counter exists or if it could get a value or even an
>>> accurate value since the updates are asynchronous." -- do you mean if a
>>> mapper is updating custom counter ABC, and another mapper is updating the
>>> same customer counter ABC, their counter values are updated independently
>>> by different mappers, and will not published (aggregated) externally until
>>> job completed successfully?
>>>
>>> I meant that if a Task created and updated a counter, a different Task
>>> has access to that counter.
>>>
>>> To give you an example, if I want to count the number of quality errors
>>> and then fail after X number of errors, I can't use Global counters to do
>>> this.
>>>
>>> regards,
>>> Lin
>>>
>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>> As I understand it... each Task has its own counters and are
>>>> independently updated. As they report back to the JT, they update the
>>>> counter(s)' status.
>>>> The JT then will aggregate them.
>>>>
>>>> In terms of performance, Counters take up some memory in the JT so
>>>> while its OK to use them, if you abuse them, you can run in to issues.
>>>> As to limits... I guess that will depend on the amount of memory on the
>>>> JT machine, the size of the cluster (Number of TT) and the number of
>>>> counters.
>>>>
>>>> In terms of global accessibility... Maybe.
>>>>
>>>> The reason I say maybe is that I'm not sure by what you mean by
>>>> globally accessible.
>>>> If a task creates and implements a dynamic counter... I know that it
>>>> will eventually be reflected in the JT. However, I do not believe that a
>>>> separate Task could connect with the JT and see if the counter exists or if
>>>> it could get a value or even an accurate value since the updates are
>>>> asynchronous.  Not to mention that I don't believe that the counters are
>>>> aggregated until the job ends. It would make sense that the JT maintains a
>>>> unique counter for each task until the tasks complete. (If a task fails, it
>>>> would have to delete the counters so that when the task is restarted the
>>>> correct count is maintained. )  Note, I haven't looked at the source code
>>>> so I am probably wrong.
>>>>
>>>> HTH
>>>> Mike
>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>>
>>>> Hi guys,
>>>>
>>>> I have some quick questions regarding to Hadoop counter,
>>>>
>>>>
>>>>    - Hadoop counter (customer defined) is global accessible (for both
>>>>    read and write) for all Mappers and Reducers in a job?
>>>>    - What is the performance and best practices of using Hadoop
>>>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>>>    performance downgrade to the whole job?
>>>>
>>>> regards,
>>>> Lin
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: Hadoop counter

Posted by Michael Segel <mi...@hotmail.com>.

On Oct 19, 2012, at 10:27 PM, Lin Ma <li...@gmail.com> wrote:

> Thanks for the detailed reply Mike, I learned a lot from the discussion.
> 
> - I just want to confirm with you that, supposing in the same job, when a specific task completed (and counter is aggregated in JT after the task completed from our discussion?), the other running task in the same job cannot get the updated counter value from the previous completed task? I am asking this because I am thinking whether I can use counter to share a global value between tasks.

Yes that is correct. 
While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy way for a task to query the job tracker. This might have changed in YARN

> - If so, what is the traditional use case of counter, only use counter values after the whole job completes?
> 
Yes the counters are used to provide data at the end of the job... 
> BTW: appreciate if you could share me a few use cases from your experience about how counters are used.
> 
Well you have your typical job data like the number of records processed, total number of bytes read,  bytes written... 

But suppose you wanted to do some quality control on your input. 
So you need to keep a track on the count of bad records.  If this job is part of a process, you may want to include business logic in your job to halt the job flow if X% of the records contain bad data. 

Or your process takes input records and in processing them, they sort the records based on some characteristic and you want to count those sorted records as you processed them. 

For a more concrete example, the Illinois Tollway has these 'fast pass' lanes where cars equipped with RFID tags can have the tolls automatically deducted from their accounts rather than pay the toll manually each time. 

Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are cheaters where they drive through the sensor and the sensor doesn't capture the RFID tag. (Note its possible that you have a false positive where the car has an RFID chip but doesn't trip the sensor.) Pushing the data in a map/reduce job would require the use of counters.

Does that help? 

-Mike

> regards,
> Lin
> 
> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <mi...@hotmail.com> wrote:
> Yeah, sorry... 
> 
> I meant that if you were dynamically creating a counter foo in the Mapper task, then each mapper would be creating their own counter foo. 
> As the job runs, these counters will eventually be sent up to the JT. The job tracker would keep a separate counter for each task. 
> 
> At the end, the final count is aggregated from the list of counters for foo. 
> 
> 
> I don't know how you can get a task to ask information from the Job Tracker on how things are going in other tasks.  That is what I meant that you couldn't get information about the other counters or even the status of the other tasks running in the same job. 
> 
> I didn't see anything in the APIs that allowed for that type of flow... Of course having said that... someone pops up with a way to do just that. ;-) 
> 
> 
> Does that clarify things? 
> 
> -Mike
> 
> 
> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
> 
>> Hi Mike,
>> 
>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>> 
>> From your this statement "It would make sense that the JT maintains a unique counter for each task until the tasks complete." -- it seems each task cannot see counters from each other, since JT maintains a unique counter for each tasks;
>> 
>> From your this comment "I meant that if a Task created and updated a counter, a different Task has access to that counter. " -- it seems different tasks could share/access the same counter.
>> 
>> Appreciate if you could help to clarify a bit.
>> 
>> regards,
>> Lin
>> 
>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <mi...@hotmail.com> wrote:
>> 
>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>> 
>>> Hi Mike,
>>> 
>>> Thanks for the detailed reply. Two quick questions/comments,
>>> 
>>> 1. For "task", you mean a specific mapper instance, or a specific reducer instance?
>> 
>> Either. 
>> 
>>> 2. "However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous." -- do you mean if a mapper is updating custom counter ABC, and another mapper is updating the same customer counter ABC, their counter values are updated independently by different mappers, and will not published (aggregated) externally until job completed successfully?
>>> 
>> I meant that if a Task created and updated a counter, a different Task has access to that counter. 
>> 
>> To give you an example, if I want to count the number of quality errors and then fail after X number of errors, I can't use Global counters to do this.
>> 
>>> regards,
>>> Lin
>>> 
>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <mi...@hotmail.com> wrote:
>>> As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status.
>>> The JT then will aggregate them. 
>>> 
>>> In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them, you can run in to issues. 
>>> As to limits... I guess that will depend on the amount of memory on the JT machine, the size of the cluster (Number of TT) and the number of counters. 
>>> 
>>> In terms of global accessibility... Maybe.
>>> 
>>> The reason I say maybe is that I'm not sure by what you mean by globally accessible. 
>>> If a task creates and implements a dynamic counter... I know that it will eventually be reflected in the JT. However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous.  Not to mention that I don't believe that the counters are aggregated until the job ends. It would make sense that the JT maintains a unique counter for each task until the tasks complete. (If a task fails, it would have to delete the counters so that when the task is restarted the correct count is maintained. )  Note, I haven't looked at the source code so I am probably wrong. 
>>> 
>>> HTH
>>> Mike
>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>> 
>>>> Hi guys,
>>>> 
>>>> I have some quick questions regarding to Hadoop counter,
>>>> 
>>>> Hadoop counter (customer defined) is global accessible (for both read and write) for all Mappers and Reducers in a job?
>>>> What is the performance and best practices of using Hadoop counters? I am not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole job?
>>>> regards,
>>>> Lin
>>> 
>>> 
>> 
>> 
> 
>

Re: Hadoop counter

Posted by Michael Segel <mi...@hotmail.com>.

On Oct 19, 2012, at 10:27 PM, Lin Ma <li...@gmail.com> wrote:

> Thanks for the detailed reply Mike, I learned a lot from the discussion.
> 
> - I just want to confirm with you that, supposing in the same job, when a specific task completed (and counter is aggregated in JT after the task completed from our discussion?), the other running task in the same job cannot get the updated counter value from the previous completed task? I am asking this because I am thinking whether I can use counter to share a global value between tasks.

Yes that is correct. 
While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy way for a task to query the job tracker. This might have changed in YARN

> - If so, what is the traditional use case of counter, only use counter values after the whole job completes?
> 
Yes the counters are used to provide data at the end of the job... 
> BTW: appreciate if you could share me a few use cases from your experience about how counters are used.
> 
Well you have your typical job data like the number of records processed, total number of bytes read,  bytes written... 

But suppose you wanted to do some quality control on your input. 
So you need to keep a track on the count of bad records.  If this job is part of a process, you may want to include business logic in your job to halt the job flow if X% of the records contain bad data. 

Or your process takes input records and in processing them, they sort the records based on some characteristic and you want to count those sorted records as you processed them. 

For a more concrete example, the Illinois Tollway has these 'fast pass' lanes where cars equipped with RFID tags can have the tolls automatically deducted from their accounts rather than pay the toll manually each time. 

Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are cheaters where they drive through the sensor and the sensor doesn't capture the RFID tag. (Note its possible that you have a false positive where the car has an RFID chip but doesn't trip the sensor.) Pushing the data in a map/reduce job would require the use of counters.

Does that help? 

-Mike

> regards,
> Lin
> 
> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <mi...@hotmail.com> wrote:
> Yeah, sorry... 
> 
> I meant that if you were dynamically creating a counter foo in the Mapper task, then each mapper would be creating their own counter foo. 
> As the job runs, these counters will eventually be sent up to the JT. The job tracker would keep a separate counter for each task. 
> 
> At the end, the final count is aggregated from the list of counters for foo. 
> 
> 
> I don't know how you can get a task to ask information from the Job Tracker on how things are going in other tasks.  That is what I meant that you couldn't get information about the other counters or even the status of the other tasks running in the same job. 
> 
> I didn't see anything in the APIs that allowed for that type of flow... Of course having said that... someone pops up with a way to do just that. ;-) 
> 
> 
> Does that clarify things? 
> 
> -Mike
> 
> 
> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
> 
>> Hi Mike,
>> 
>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>> 
>> From your this statement "It would make sense that the JT maintains a unique counter for each task until the tasks complete." -- it seems each task cannot see counters from each other, since JT maintains a unique counter for each tasks;
>> 
>> From your this comment "I meant that if a Task created and updated a counter, a different Task has access to that counter. " -- it seems different tasks could share/access the same counter.
>> 
>> Appreciate if you could help to clarify a bit.
>> 
>> regards,
>> Lin
>> 
>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <mi...@hotmail.com> wrote:
>> 
>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>> 
>>> Hi Mike,
>>> 
>>> Thanks for the detailed reply. Two quick questions/comments,
>>> 
>>> 1. For "task", you mean a specific mapper instance, or a specific reducer instance?
>> 
>> Either. 
>> 
>>> 2. "However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous." -- do you mean if a mapper is updating custom counter ABC, and another mapper is updating the same customer counter ABC, their counter values are updated independently by different mappers, and will not published (aggregated) externally until job completed successfully?
>>> 
>> I meant that if a Task created and updated a counter, a different Task has access to that counter. 
>> 
>> To give you an example, if I want to count the number of quality errors and then fail after X number of errors, I can't use Global counters to do this.
>> 
>>> regards,
>>> Lin
>>> 
>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <mi...@hotmail.com> wrote:
>>> As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status.
>>> The JT then will aggregate them. 
>>> 
>>> In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them, you can run in to issues. 
>>> As to limits... I guess that will depend on the amount of memory on the JT machine, the size of the cluster (Number of TT) and the number of counters. 
>>> 
>>> In terms of global accessibility... Maybe.
>>> 
>>> The reason I say maybe is that I'm not sure by what you mean by globally accessible. 
>>> If a task creates and implements a dynamic counter... I know that it will eventually be reflected in the JT. However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous.  Not to mention that I don't believe that the counters are aggregated until the job ends. It would make sense that the JT maintains a unique counter for each task until the tasks complete. (If a task fails, it would have to delete the counters so that when the task is restarted the correct count is maintained. )  Note, I haven't looked at the source code so I am probably wrong. 
>>> 
>>> HTH
>>> Mike
>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>> 
>>>> Hi guys,
>>>> 
>>>> I have some quick questions regarding to Hadoop counter,
>>>> 
>>>> Hadoop counter (customer defined) is global accessible (for both read and write) for all Mappers and Reducers in a job?
>>>> What is the performance and best practices of using Hadoop counters? I am not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole job?
>>>> regards,
>>>> Lin
>>> 
>>> 
>> 
>> 
> 
>

Re: Hadoop counter

Posted by Michael Segel <mi...@hotmail.com>.

On Oct 19, 2012, at 10:27 PM, Lin Ma <li...@gmail.com> wrote:

> Thanks for the detailed reply Mike, I learned a lot from the discussion.
> 
> - I just want to confirm with you that, supposing in the same job, when a specific task completed (and counter is aggregated in JT after the task completed from our discussion?), the other running task in the same job cannot get the updated counter value from the previous completed task? I am asking this because I am thinking whether I can use counter to share a global value between tasks.

Yes that is correct. 
While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy way for a task to query the job tracker. This might have changed in YARN

> - If so, what is the traditional use case of counter, only use counter values after the whole job completes?
> 
Yes the counters are used to provide data at the end of the job... 
> BTW: appreciate if you could share me a few use cases from your experience about how counters are used.
> 
Well you have your typical job data like the number of records processed, total number of bytes read,  bytes written... 

But suppose you wanted to do some quality control on your input. 
So you need to keep a track on the count of bad records.  If this job is part of a process, you may want to include business logic in your job to halt the job flow if X% of the records contain bad data. 

Or your process takes input records and in processing them, they sort the records based on some characteristic and you want to count those sorted records as you processed them. 

For a more concrete example, the Illinois Tollway has these 'fast pass' lanes where cars equipped with RFID tags can have the tolls automatically deducted from their accounts rather than pay the toll manually each time. 

Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are cheaters where they drive through the sensor and the sensor doesn't capture the RFID tag. (Note its possible that you have a false positive where the car has an RFID chip but doesn't trip the sensor.) Pushing the data in a map/reduce job would require the use of counters.

Does that help? 

-Mike

> regards,
> Lin
> 
> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <mi...@hotmail.com> wrote:
> Yeah, sorry... 
> 
> I meant that if you were dynamically creating a counter foo in the Mapper task, then each mapper would be creating their own counter foo. 
> As the job runs, these counters will eventually be sent up to the JT. The job tracker would keep a separate counter for each task. 
> 
> At the end, the final count is aggregated from the list of counters for foo. 
> 
> 
> I don't know how you can get a task to ask information from the Job Tracker on how things are going in other tasks.  That is what I meant that you couldn't get information about the other counters or even the status of the other tasks running in the same job. 
> 
> I didn't see anything in the APIs that allowed for that type of flow... Of course having said that... someone pops up with a way to do just that. ;-) 
> 
> 
> Does that clarify things? 
> 
> -Mike
> 
> 
> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
> 
>> Hi Mike,
>> 
>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>> 
>> From your this statement "It would make sense that the JT maintains a unique counter for each task until the tasks complete." -- it seems each task cannot see counters from each other, since JT maintains a unique counter for each tasks;
>> 
>> From your this comment "I meant that if a Task created and updated a counter, a different Task has access to that counter. " -- it seems different tasks could share/access the same counter.
>> 
>> Appreciate if you could help to clarify a bit.
>> 
>> regards,
>> Lin
>> 
>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <mi...@hotmail.com> wrote:
>> 
>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>> 
>>> Hi Mike,
>>> 
>>> Thanks for the detailed reply. Two quick questions/comments,
>>> 
>>> 1. For "task", you mean a specific mapper instance, or a specific reducer instance?
>> 
>> Either. 
>> 
>>> 2. "However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous." -- do you mean if a mapper is updating custom counter ABC, and another mapper is updating the same customer counter ABC, their counter values are updated independently by different mappers, and will not published (aggregated) externally until job completed successfully?
>>> 
>> I meant that if a Task created and updated a counter, a different Task has access to that counter. 
>> 
>> To give you an example, if I want to count the number of quality errors and then fail after X number of errors, I can't use Global counters to do this.
>> 
>>> regards,
>>> Lin
>>> 
>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <mi...@hotmail.com> wrote:
>>> As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status.
>>> The JT then will aggregate them. 
>>> 
>>> In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them, you can run in to issues. 
>>> As to limits... I guess that will depend on the amount of memory on the JT machine, the size of the cluster (Number of TT) and the number of counters. 
>>> 
>>> In terms of global accessibility... Maybe.
>>> 
>>> The reason I say maybe is that I'm not sure by what you mean by globally accessible. 
>>> If a task creates and implements a dynamic counter... I know that it will eventually be reflected in the JT. However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous.  Not to mention that I don't believe that the counters are aggregated until the job ends. It would make sense that the JT maintains a unique counter for each task until the tasks complete. (If a task fails, it would have to delete the counters so that when the task is restarted the correct count is maintained. )  Note, I haven't looked at the source code so I am probably wrong. 
>>> 
>>> HTH
>>> Mike
>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>> 
>>>> Hi guys,
>>>> 
>>>> I have some quick questions regarding to Hadoop counter,
>>>> 
>>>> Hadoop counter (customer defined) is global accessible (for both read and write) for all Mappers and Reducers in a job?
>>>> What is the performance and best practices of using Hadoop counters? I am not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole job?
>>>> regards,
>>>> Lin
>>> 
>>> 
>> 
>> 
> 
>

Re: Hadoop counter

Posted by Michael Segel <mi...@hotmail.com>.

On Oct 19, 2012, at 10:27 PM, Lin Ma <li...@gmail.com> wrote:

> Thanks for the detailed reply Mike, I learned a lot from the discussion.
> 
> - I just want to confirm with you that, supposing in the same job, when a specific task completed (and counter is aggregated in JT after the task completed from our discussion?), the other running task in the same job cannot get the updated counter value from the previous completed task? I am asking this because I am thinking whether I can use counter to share a global value between tasks.

Yes that is correct. 
While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy way for a task to query the job tracker. This might have changed in YARN

> - If so, what is the traditional use case of counter, only use counter values after the whole job completes?
> 
Yes the counters are used to provide data at the end of the job... 
> BTW: appreciate if you could share me a few use cases from your experience about how counters are used.
> 
Well you have your typical job data like the number of records processed, total number of bytes read,  bytes written... 

But suppose you wanted to do some quality control on your input. 
So you need to keep a track on the count of bad records.  If this job is part of a process, you may want to include business logic in your job to halt the job flow if X% of the records contain bad data. 

Or your process takes input records and in processing them, they sort the records based on some characteristic and you want to count those sorted records as you processed them. 

For a more concrete example, the Illinois Tollway has these 'fast pass' lanes where cars equipped with RFID tags can have the tolls automatically deducted from their accounts rather than pay the toll manually each time. 

Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are cheaters where they drive through the sensor and the sensor doesn't capture the RFID tag. (Note its possible that you have a false positive where the car has an RFID chip but doesn't trip the sensor.) Pushing the data in a map/reduce job would require the use of counters.

Does that help? 

-Mike

> regards,
> Lin
> 
> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <mi...@hotmail.com> wrote:
> Yeah, sorry... 
> 
> I meant that if you were dynamically creating a counter foo in the Mapper task, then each mapper would be creating their own counter foo. 
> As the job runs, these counters will eventually be sent up to the JT. The job tracker would keep a separate counter for each task. 
> 
> At the end, the final count is aggregated from the list of counters for foo. 
> 
> 
> I don't know how you can get a task to ask information from the Job Tracker on how things are going in other tasks.  That is what I meant that you couldn't get information about the other counters or even the status of the other tasks running in the same job. 
> 
> I didn't see anything in the APIs that allowed for that type of flow... Of course having said that... someone pops up with a way to do just that. ;-) 
> 
> 
> Does that clarify things? 
> 
> -Mike
> 
> 
> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
> 
>> Hi Mike,
>> 
>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>> 
>> From your this statement "It would make sense that the JT maintains a unique counter for each task until the tasks complete." -- it seems each task cannot see counters from each other, since JT maintains a unique counter for each tasks;
>> 
>> From your this comment "I meant that if a Task created and updated a counter, a different Task has access to that counter. " -- it seems different tasks could share/access the same counter.
>> 
>> Appreciate if you could help to clarify a bit.
>> 
>> regards,
>> Lin
>> 
>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <mi...@hotmail.com> wrote:
>> 
>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>> 
>>> Hi Mike,
>>> 
>>> Thanks for the detailed reply. Two quick questions/comments,
>>> 
>>> 1. For "task", you mean a specific mapper instance, or a specific reducer instance?
>> 
>> Either. 
>> 
>>> 2. "However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous." -- do you mean if a mapper is updating custom counter ABC, and another mapper is updating the same customer counter ABC, their counter values are updated independently by different mappers, and will not published (aggregated) externally until job completed successfully?
>>> 
>> I meant that if a Task created and updated a counter, a different Task has access to that counter. 
>> 
>> To give you an example, if I want to count the number of quality errors and then fail after X number of errors, I can't use Global counters to do this.
>> 
>>> regards,
>>> Lin
>>> 
>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <mi...@hotmail.com> wrote:
>>> As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status.
>>> The JT then will aggregate them. 
>>> 
>>> In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them, you can run in to issues. 
>>> As to limits... I guess that will depend on the amount of memory on the JT machine, the size of the cluster (Number of TT) and the number of counters. 
>>> 
>>> In terms of global accessibility... Maybe.
>>> 
>>> The reason I say maybe is that I'm not sure by what you mean by globally accessible. 
>>> If a task creates and implements a dynamic counter... I know that it will eventually be reflected in the JT. However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous.  Not to mention that I don't believe that the counters are aggregated until the job ends. It would make sense that the JT maintains a unique counter for each task until the tasks complete. (If a task fails, it would have to delete the counters so that when the task is restarted the correct count is maintained. )  Note, I haven't looked at the source code so I am probably wrong. 
>>> 
>>> HTH
>>> Mike
>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>> 
>>>> Hi guys,
>>>> 
>>>> I have some quick questions regarding to Hadoop counter,
>>>> 
>>>> Hadoop counter (customer defined) is global accessible (for both read and write) for all Mappers and Reducers in a job?
>>>> What is the performance and best practices of using Hadoop counters? I am not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole job?
>>>> regards,
>>>> Lin
>>> 
>>> 
>> 
>> 
> 
>

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Thanks for the detailed reply Mike, I learned a lot from the discussion.

- I just want to confirm with you that, supposing in the same job, when a
specific task completed (and counter is aggregated in JT after the task
completed from our discussion?), the other running task in the same job
cannot get the updated counter value from the previous completed task? I am
asking this because I am thinking whether I can use counter to share a
global value between tasks.
- If so, what is the traditional use case of counter, only use counter
values after the whole job completes?

BTW: appreciate if you could share me a few use cases from your experience
about how counters are used.

regards,
Lin

On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <mi...@hotmail.com>wrote:

> Yeah, sorry...
>
> I meant that if you were dynamically creating a counter foo in the Mapper
> task, then each mapper would be creating their own counter foo.
> As the job runs, these counters will eventually be sent up to the JT. The
> job tracker would keep a separate counter for each task.
>
> At the end, the final count is aggregated from the list of counters for
> foo.
>
>
> I don't know how you can get a task to ask information from the Job
> Tracker on how things are going in other tasks.  That is what I meant that
> you couldn't get information about the other counters or even the status of
> the other tasks running in the same job.
>
> I didn't see anything in the APIs that allowed for that type of flow... Of
> course having said that... someone pops up with a way to do just that. ;-)
>
>
> Does that clarify things?
>
> -Mike
>
>
> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
>
> Hi Mike,
>
> Sorry I am a bit lost... As you are thinking faster than me. :-P
>
> From your this statement "It would make sense that the JT maintains a
> unique counter for each task until the tasks complete." -- it seems each
> task cannot see counters from each other, since JT maintains a unique
> counter for each tasks;
>
> From your this comment "I meant that if a Task created and updated a
> counter, a different Task has access to that counter. " -- it seems
> different tasks could share/access the same counter.
>
> Appreciate if you could help to clarify a bit.
>
> regards,
> Lin
>
> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <michael_segel@hotmail.com
> > wrote:
>
>>
>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>>
>> Hi Mike,
>>
>> Thanks for the detailed reply. Two quick questions/comments,
>>
>> 1. For "task", you mean a specific mapper instance, or a specific reducer
>> instance?
>>
>>
>> Either.
>>
>> 2. "However, I do not believe that a separate Task could connect with the
>> JT and see if the counter exists or if it could get a value or even an
>> accurate value since the updates are asynchronous." -- do you mean if a
>> mapper is updating custom counter ABC, and another mapper is updating the
>> same customer counter ABC, their counter values are updated independently
>> by different mappers, and will not published (aggregated) externally until
>> job completed successfully?
>>
>> I meant that if a Task created and updated a counter, a different Task
>> has access to that counter.
>>
>> To give you an example, if I want to count the number of quality errors
>> and then fail after X number of errors, I can't use Global counters to do
>> this.
>>
>> regards,
>> Lin
>>
>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <
>> michael_segel@hotmail.com> wrote:
>>
>>> As I understand it... each Task has its own counters and are
>>> independently updated. As they report back to the JT, they update the
>>> counter(s)' status.
>>> The JT then will aggregate them.
>>>
>>> In terms of performance, Counters take up some memory in the JT so while
>>> its OK to use them, if you abuse them, you can run in to issues.
>>> As to limits... I guess that will depend on the amount of memory on the
>>> JT machine, the size of the cluster (Number of TT) and the number of
>>> counters.
>>>
>>> In terms of global accessibility... Maybe.
>>>
>>> The reason I say maybe is that I'm not sure by what you mean by globally
>>> accessible.
>>> If a task creates and implements a dynamic counter... I know that it
>>> will eventually be reflected in the JT. However, I do not believe that a
>>> separate Task could connect with the JT and see if the counter exists or if
>>> it could get a value or even an accurate value since the updates are
>>> asynchronous.  Not to mention that I don't believe that the counters are
>>> aggregated until the job ends. It would make sense that the JT maintains a
>>> unique counter for each task until the tasks complete. (If a task fails, it
>>> would have to delete the counters so that when the task is restarted the
>>> correct count is maintained. )  Note, I haven't looked at the source code
>>> so I am probably wrong.
>>>
>>> HTH
>>> Mike
>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>
>>> Hi guys,
>>>
>>> I have some quick questions regarding to Hadoop counter,
>>>
>>>
>>>    - Hadoop counter (customer defined) is global accessible (for both
>>>    read and write) for all Mappers and Reducers in a job?
>>>    - What is the performance and best practices of using Hadoop
>>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>>    performance downgrade to the whole job?
>>>
>>> regards,
>>> Lin
>>>
>>>
>>>
>>
>>
>
>

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Thanks for the detailed reply Mike, I learned a lot from the discussion.

- I just want to confirm with you that, supposing in the same job, when a
specific task completed (and counter is aggregated in JT after the task
completed from our discussion?), the other running task in the same job
cannot get the updated counter value from the previous completed task? I am
asking this because I am thinking whether I can use counter to share a
global value between tasks.
- If so, what is the traditional use case of counter, only use counter
values after the whole job completes?

BTW: appreciate if you could share me a few use cases from your experience
about how counters are used.

regards,
Lin

On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <mi...@hotmail.com>wrote:

> Yeah, sorry...
>
> I meant that if you were dynamically creating a counter foo in the Mapper
> task, then each mapper would be creating their own counter foo.
> As the job runs, these counters will eventually be sent up to the JT. The
> job tracker would keep a separate counter for each task.
>
> At the end, the final count is aggregated from the list of counters for
> foo.
>
>
> I don't know how you can get a task to ask information from the Job
> Tracker on how things are going in other tasks.  That is what I meant that
> you couldn't get information about the other counters or even the status of
> the other tasks running in the same job.
>
> I didn't see anything in the APIs that allowed for that type of flow... Of
> course having said that... someone pops up with a way to do just that. ;-)
>
>
> Does that clarify things?
>
> -Mike
>
>
> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
>
> Hi Mike,
>
> Sorry I am a bit lost... As you are thinking faster than me. :-P
>
> From your this statement "It would make sense that the JT maintains a
> unique counter for each task until the tasks complete." -- it seems each
> task cannot see counters from each other, since JT maintains a unique
> counter for each tasks;
>
> From your this comment "I meant that if a Task created and updated a
> counter, a different Task has access to that counter. " -- it seems
> different tasks could share/access the same counter.
>
> Appreciate if you could help to clarify a bit.
>
> regards,
> Lin
>
> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <michael_segel@hotmail.com
> > wrote:
>
>>
>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>>
>> Hi Mike,
>>
>> Thanks for the detailed reply. Two quick questions/comments,
>>
>> 1. For "task", you mean a specific mapper instance, or a specific reducer
>> instance?
>>
>>
>> Either.
>>
>> 2. "However, I do not believe that a separate Task could connect with the
>> JT and see if the counter exists or if it could get a value or even an
>> accurate value since the updates are asynchronous." -- do you mean if a
>> mapper is updating custom counter ABC, and another mapper is updating the
>> same customer counter ABC, their counter values are updated independently
>> by different mappers, and will not published (aggregated) externally until
>> job completed successfully?
>>
>> I meant that if a Task created and updated a counter, a different Task
>> has access to that counter.
>>
>> To give you an example, if I want to count the number of quality errors
>> and then fail after X number of errors, I can't use Global counters to do
>> this.
>>
>> regards,
>> Lin
>>
>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <
>> michael_segel@hotmail.com> wrote:
>>
>>> As I understand it... each Task has its own counters and are
>>> independently updated. As they report back to the JT, they update the
>>> counter(s)' status.
>>> The JT then will aggregate them.
>>>
>>> In terms of performance, Counters take up some memory in the JT so while
>>> its OK to use them, if you abuse them, you can run in to issues.
>>> As to limits... I guess that will depend on the amount of memory on the
>>> JT machine, the size of the cluster (Number of TT) and the number of
>>> counters.
>>>
>>> In terms of global accessibility... Maybe.
>>>
>>> The reason I say maybe is that I'm not sure by what you mean by globally
>>> accessible.
>>> If a task creates and implements a dynamic counter... I know that it
>>> will eventually be reflected in the JT. However, I do not believe that a
>>> separate Task could connect with the JT and see if the counter exists or if
>>> it could get a value or even an accurate value since the updates are
>>> asynchronous.  Not to mention that I don't believe that the counters are
>>> aggregated until the job ends. It would make sense that the JT maintains a
>>> unique counter for each task until the tasks complete. (If a task fails, it
>>> would have to delete the counters so that when the task is restarted the
>>> correct count is maintained. )  Note, I haven't looked at the source code
>>> so I am probably wrong.
>>>
>>> HTH
>>> Mike
>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>
>>> Hi guys,
>>>
>>> I have some quick questions regarding to Hadoop counter,
>>>
>>>
>>>    - Hadoop counter (customer defined) is global accessible (for both
>>>    read and write) for all Mappers and Reducers in a job?
>>>    - What is the performance and best practices of using Hadoop
>>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>>    performance downgrade to the whole job?
>>>
>>> regards,
>>> Lin
>>>
>>>
>>>
>>
>>
>
>

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Thanks for the detailed reply Mike, I learned a lot from the discussion.

- I just want to confirm with you that, supposing in the same job, when a
specific task completed (and counter is aggregated in JT after the task
completed from our discussion?), the other running task in the same job
cannot get the updated counter value from the previous completed task? I am
asking this because I am thinking whether I can use counter to share a
global value between tasks.
- If so, what is the traditional use case of counter, only use counter
values after the whole job completes?

BTW: appreciate if you could share me a few use cases from your experience
about how counters are used.

regards,
Lin

On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <mi...@hotmail.com>wrote:

> Yeah, sorry...
>
> I meant that if you were dynamically creating a counter foo in the Mapper
> task, then each mapper would be creating their own counter foo.
> As the job runs, these counters will eventually be sent up to the JT. The
> job tracker would keep a separate counter for each task.
>
> At the end, the final count is aggregated from the list of counters for
> foo.
>
>
> I don't know how you can get a task to ask information from the Job
> Tracker on how things are going in other tasks.  That is what I meant that
> you couldn't get information about the other counters or even the status of
> the other tasks running in the same job.
>
> I didn't see anything in the APIs that allowed for that type of flow... Of
> course having said that... someone pops up with a way to do just that. ;-)
>
>
> Does that clarify things?
>
> -Mike
>
>
> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
>
> Hi Mike,
>
> Sorry I am a bit lost... As you are thinking faster than me. :-P
>
> From your this statement "It would make sense that the JT maintains a
> unique counter for each task until the tasks complete." -- it seems each
> task cannot see counters from each other, since JT maintains a unique
> counter for each tasks;
>
> From your this comment "I meant that if a Task created and updated a
> counter, a different Task has access to that counter. " -- it seems
> different tasks could share/access the same counter.
>
> Appreciate if you could help to clarify a bit.
>
> regards,
> Lin
>
> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <michael_segel@hotmail.com
> > wrote:
>
>>
>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>>
>> Hi Mike,
>>
>> Thanks for the detailed reply. Two quick questions/comments,
>>
>> 1. For "task", you mean a specific mapper instance, or a specific reducer
>> instance?
>>
>>
>> Either.
>>
>> 2. "However, I do not believe that a separate Task could connect with the
>> JT and see if the counter exists or if it could get a value or even an
>> accurate value since the updates are asynchronous." -- do you mean if a
>> mapper is updating custom counter ABC, and another mapper is updating the
>> same customer counter ABC, their counter values are updated independently
>> by different mappers, and will not published (aggregated) externally until
>> job completed successfully?
>>
>> I meant that if a Task created and updated a counter, a different Task
>> has access to that counter.
>>
>> To give you an example, if I want to count the number of quality errors
>> and then fail after X number of errors, I can't use Global counters to do
>> this.
>>
>> regards,
>> Lin
>>
>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <
>> michael_segel@hotmail.com> wrote:
>>
>>> As I understand it... each Task has its own counters and are
>>> independently updated. As they report back to the JT, they update the
>>> counter(s)' status.
>>> The JT then will aggregate them.
>>>
>>> In terms of performance, Counters take up some memory in the JT so while
>>> its OK to use them, if you abuse them, you can run in to issues.
>>> As to limits... I guess that will depend on the amount of memory on the
>>> JT machine, the size of the cluster (Number of TT) and the number of
>>> counters.
>>>
>>> In terms of global accessibility... Maybe.
>>>
>>> The reason I say maybe is that I'm not sure by what you mean by globally
>>> accessible.
>>> If a task creates and implements a dynamic counter... I know that it
>>> will eventually be reflected in the JT. However, I do not believe that a
>>> separate Task could connect with the JT and see if the counter exists or if
>>> it could get a value or even an accurate value since the updates are
>>> asynchronous.  Not to mention that I don't believe that the counters are
>>> aggregated until the job ends. It would make sense that the JT maintains a
>>> unique counter for each task until the tasks complete. (If a task fails, it
>>> would have to delete the counters so that when the task is restarted the
>>> correct count is maintained. )  Note, I haven't looked at the source code
>>> so I am probably wrong.
>>>
>>> HTH
>>> Mike
>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>
>>> Hi guys,
>>>
>>> I have some quick questions regarding to Hadoop counter,
>>>
>>>
>>>    - Hadoop counter (customer defined) is global accessible (for both
>>>    read and write) for all Mappers and Reducers in a job?
>>>    - What is the performance and best practices of using Hadoop
>>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>>    performance downgrade to the whole job?
>>>
>>> regards,
>>> Lin
>>>
>>>
>>>
>>
>>
>
>

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Thanks for the detailed reply Mike, I learned a lot from the discussion.

- I just want to confirm with you that, supposing in the same job, when a
specific task completed (and counter is aggregated in JT after the task
completed from our discussion?), the other running task in the same job
cannot get the updated counter value from the previous completed task? I am
asking this because I am thinking whether I can use counter to share a
global value between tasks.
- If so, what is the traditional use case of counter, only use counter
values after the whole job completes?

BTW: appreciate if you could share me a few use cases from your experience
about how counters are used.

regards,
Lin

On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <mi...@hotmail.com>wrote:

> Yeah, sorry...
>
> I meant that if you were dynamically creating a counter foo in the Mapper
> task, then each mapper would be creating their own counter foo.
> As the job runs, these counters will eventually be sent up to the JT. The
> job tracker would keep a separate counter for each task.
>
> At the end, the final count is aggregated from the list of counters for
> foo.
>
>
> I don't know how you can get a task to ask information from the Job
> Tracker on how things are going in other tasks.  That is what I meant that
> you couldn't get information about the other counters or even the status of
> the other tasks running in the same job.
>
> I didn't see anything in the APIs that allowed for that type of flow... Of
> course having said that... someone pops up with a way to do just that. ;-)
>
>
> Does that clarify things?
>
> -Mike
>
>
> On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:
>
> Hi Mike,
>
> Sorry I am a bit lost... As you are thinking faster than me. :-P
>
> From your this statement "It would make sense that the JT maintains a
> unique counter for each task until the tasks complete." -- it seems each
> task cannot see counters from each other, since JT maintains a unique
> counter for each tasks;
>
> From your this comment "I meant that if a Task created and updated a
> counter, a different Task has access to that counter. " -- it seems
> different tasks could share/access the same counter.
>
> Appreciate if you could help to clarify a bit.
>
> regards,
> Lin
>
> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <michael_segel@hotmail.com
> > wrote:
>
>>
>> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>>
>> Hi Mike,
>>
>> Thanks for the detailed reply. Two quick questions/comments,
>>
>> 1. For "task", you mean a specific mapper instance, or a specific reducer
>> instance?
>>
>>
>> Either.
>>
>> 2. "However, I do not believe that a separate Task could connect with the
>> JT and see if the counter exists or if it could get a value or even an
>> accurate value since the updates are asynchronous." -- do you mean if a
>> mapper is updating custom counter ABC, and another mapper is updating the
>> same customer counter ABC, their counter values are updated independently
>> by different mappers, and will not published (aggregated) externally until
>> job completed successfully?
>>
>> I meant that if a Task created and updated a counter, a different Task
>> has access to that counter.
>>
>> To give you an example, if I want to count the number of quality errors
>> and then fail after X number of errors, I can't use Global counters to do
>> this.
>>
>> regards,
>> Lin
>>
>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <
>> michael_segel@hotmail.com> wrote:
>>
>>> As I understand it... each Task has its own counters and are
>>> independently updated. As they report back to the JT, they update the
>>> counter(s)' status.
>>> The JT then will aggregate them.
>>>
>>> In terms of performance, Counters take up some memory in the JT so while
>>> its OK to use them, if you abuse them, you can run in to issues.
>>> As to limits... I guess that will depend on the amount of memory on the
>>> JT machine, the size of the cluster (Number of TT) and the number of
>>> counters.
>>>
>>> In terms of global accessibility... Maybe.
>>>
>>> The reason I say maybe is that I'm not sure by what you mean by globally
>>> accessible.
>>> If a task creates and implements a dynamic counter... I know that it
>>> will eventually be reflected in the JT. However, I do not believe that a
>>> separate Task could connect with the JT and see if the counter exists or if
>>> it could get a value or even an accurate value since the updates are
>>> asynchronous.  Not to mention that I don't believe that the counters are
>>> aggregated until the job ends. It would make sense that the JT maintains a
>>> unique counter for each task until the tasks complete. (If a task fails, it
>>> would have to delete the counters so that when the task is restarted the
>>> correct count is maintained. )  Note, I haven't looked at the source code
>>> so I am probably wrong.
>>>
>>> HTH
>>> Mike
>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>
>>> Hi guys,
>>>
>>> I have some quick questions regarding to Hadoop counter,
>>>
>>>
>>>    - Hadoop counter (customer defined) is global accessible (for both
>>>    read and write) for all Mappers and Reducers in a job?
>>>    - What is the performance and best practices of using Hadoop
>>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>>    performance downgrade to the whole job?
>>>
>>> regards,
>>> Lin
>>>
>>>
>>>
>>
>>
>
>

Re: Hadoop counter

Posted by Michael Segel <mi...@hotmail.com>.

Yeah, sorry... 

I meant that if you were dynamically creating a counter foo in the Mapper task, then each mapper would be creating their own counter foo. 
As the job runs, these counters will eventually be sent up to the JT. The job tracker would keep a separate counter for each task. 

At the end, the final count is aggregated from the list of counters for foo. 


I don't know how you can get a task to ask information from the Job Tracker on how things are going in other tasks.  That is what I meant that you couldn't get information about the other counters or even the status of the other tasks running in the same job. 

I didn't see anything in the APIs that allowed for that type of flow... Of course having said that... someone pops up with a way to do just that. ;-) 


Does that clarify things? 

-Mike


On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:

> Hi Mike,
> 
> Sorry I am a bit lost... As you are thinking faster than me. :-P
> 
> From your this statement "It would make sense that the JT maintains a unique counter for each task until the tasks complete." -- it seems each task cannot see counters from each other, since JT maintains a unique counter for each tasks;
> 
> From your this comment "I meant that if a Task created and updated a counter, a different Task has access to that counter. " -- it seems different tasks could share/access the same counter.
> 
> Appreciate if you could help to clarify a bit.
> 
> regards,
> Lin
> 
> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <mi...@hotmail.com> wrote:
> 
> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
> 
>> Hi Mike,
>> 
>> Thanks for the detailed reply. Two quick questions/comments,
>> 
>> 1. For "task", you mean a specific mapper instance, or a specific reducer instance?
> 
> Either. 
> 
>> 2. "However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous." -- do you mean if a mapper is updating custom counter ABC, and another mapper is updating the same customer counter ABC, their counter values are updated independently by different mappers, and will not published (aggregated) externally until job completed successfully?
>> 
> I meant that if a Task created and updated a counter, a different Task has access to that counter. 
> 
> To give you an example, if I want to count the number of quality errors and then fail after X number of errors, I can't use Global counters to do this.
> 
>> regards,
>> Lin
>> 
>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <mi...@hotmail.com> wrote:
>> As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status.
>> The JT then will aggregate them. 
>> 
>> In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them, you can run in to issues. 
>> As to limits... I guess that will depend on the amount of memory on the JT machine, the size of the cluster (Number of TT) and the number of counters. 
>> 
>> In terms of global accessibility... Maybe.
>> 
>> The reason I say maybe is that I'm not sure by what you mean by globally accessible. 
>> If a task creates and implements a dynamic counter... I know that it will eventually be reflected in the JT. However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous.  Not to mention that I don't believe that the counters are aggregated until the job ends. It would make sense that the JT maintains a unique counter for each task until the tasks complete. (If a task fails, it would have to delete the counters so that when the task is restarted the correct count is maintained. )  Note, I haven't looked at the source code so I am probably wrong. 
>> 
>> HTH
>> Mike
>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>> 
>>> Hi guys,
>>> 
>>> I have some quick questions regarding to Hadoop counter,
>>> 
>>> Hadoop counter (customer defined) is global accessible (for both read and write) for all Mappers and Reducers in a job?
>>> What is the performance and best practices of using Hadoop counters? I am not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole job?
>>> regards,
>>> Lin
>> 
>> 
> 
>

Re: Hadoop counter

Posted by Michael Segel <mi...@hotmail.com>.

Yeah, sorry... 

I meant that if you were dynamically creating a counter foo in the Mapper task, then each mapper would be creating their own counter foo. 
As the job runs, these counters will eventually be sent up to the JT. The job tracker would keep a separate counter for each task. 

At the end, the final count is aggregated from the list of counters for foo. 


I don't know how you can get a task to ask information from the Job Tracker on how things are going in other tasks.  That is what I meant that you couldn't get information about the other counters or even the status of the other tasks running in the same job. 

I didn't see anything in the APIs that allowed for that type of flow... Of course having said that... someone pops up with a way to do just that. ;-) 


Does that clarify things? 

-Mike


On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:

> Hi Mike,
> 
> Sorry I am a bit lost... As you are thinking faster than me. :-P
> 
> From your this statement "It would make sense that the JT maintains a unique counter for each task until the tasks complete." -- it seems each task cannot see counters from each other, since JT maintains a unique counter for each tasks;
> 
> From your this comment "I meant that if a Task created and updated a counter, a different Task has access to that counter. " -- it seems different tasks could share/access the same counter.
> 
> Appreciate if you could help to clarify a bit.
> 
> regards,
> Lin
> 
> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <mi...@hotmail.com> wrote:
> 
> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
> 
>> Hi Mike,
>> 
>> Thanks for the detailed reply. Two quick questions/comments,
>> 
>> 1. For "task", you mean a specific mapper instance, or a specific reducer instance?
> 
> Either. 
> 
>> 2. "However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous." -- do you mean if a mapper is updating custom counter ABC, and another mapper is updating the same customer counter ABC, their counter values are updated independently by different mappers, and will not published (aggregated) externally until job completed successfully?
>> 
> I meant that if a Task created and updated a counter, a different Task has access to that counter. 
> 
> To give you an example, if I want to count the number of quality errors and then fail after X number of errors, I can't use Global counters to do this.
> 
>> regards,
>> Lin
>> 
>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <mi...@hotmail.com> wrote:
>> As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status.
>> The JT then will aggregate them. 
>> 
>> In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them, you can run in to issues. 
>> As to limits... I guess that will depend on the amount of memory on the JT machine, the size of the cluster (Number of TT) and the number of counters. 
>> 
>> In terms of global accessibility... Maybe.
>> 
>> The reason I say maybe is that I'm not sure by what you mean by globally accessible. 
>> If a task creates and implements a dynamic counter... I know that it will eventually be reflected in the JT. However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous.  Not to mention that I don't believe that the counters are aggregated until the job ends. It would make sense that the JT maintains a unique counter for each task until the tasks complete. (If a task fails, it would have to delete the counters so that when the task is restarted the correct count is maintained. )  Note, I haven't looked at the source code so I am probably wrong. 
>> 
>> HTH
>> Mike
>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>> 
>>> Hi guys,
>>> 
>>> I have some quick questions regarding to Hadoop counter,
>>> 
>>> Hadoop counter (customer defined) is global accessible (for both read and write) for all Mappers and Reducers in a job?
>>> What is the performance and best practices of using Hadoop counters? I am not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole job?
>>> regards,
>>> Lin
>> 
>> 
> 
>

Re: Hadoop counter

Posted by Michael Segel <mi...@hotmail.com>.

Yeah, sorry... 

I meant that if you were dynamically creating a counter foo in the Mapper task, then each mapper would be creating their own counter foo. 
As the job runs, these counters will eventually be sent up to the JT. The job tracker would keep a separate counter for each task. 

At the end, the final count is aggregated from the list of counters for foo. 


I don't know how you can get a task to ask information from the Job Tracker on how things are going in other tasks.  That is what I meant that you couldn't get information about the other counters or even the status of the other tasks running in the same job. 

I didn't see anything in the APIs that allowed for that type of flow... Of course having said that... someone pops up with a way to do just that. ;-) 


Does that clarify things? 

-Mike


On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:

> Hi Mike,
> 
> Sorry I am a bit lost... As you are thinking faster than me. :-P
> 
> From your this statement "It would make sense that the JT maintains a unique counter for each task until the tasks complete." -- it seems each task cannot see counters from each other, since JT maintains a unique counter for each tasks;
> 
> From your this comment "I meant that if a Task created and updated a counter, a different Task has access to that counter. " -- it seems different tasks could share/access the same counter.
> 
> Appreciate if you could help to clarify a bit.
> 
> regards,
> Lin
> 
> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <mi...@hotmail.com> wrote:
> 
> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
> 
>> Hi Mike,
>> 
>> Thanks for the detailed reply. Two quick questions/comments,
>> 
>> 1. For "task", you mean a specific mapper instance, or a specific reducer instance?
> 
> Either. 
> 
>> 2. "However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous." -- do you mean if a mapper is updating custom counter ABC, and another mapper is updating the same customer counter ABC, their counter values are updated independently by different mappers, and will not published (aggregated) externally until job completed successfully?
>> 
> I meant that if a Task created and updated a counter, a different Task has access to that counter. 
> 
> To give you an example, if I want to count the number of quality errors and then fail after X number of errors, I can't use Global counters to do this.
> 
>> regards,
>> Lin
>> 
>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <mi...@hotmail.com> wrote:
>> As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status.
>> The JT then will aggregate them. 
>> 
>> In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them, you can run in to issues. 
>> As to limits... I guess that will depend on the amount of memory on the JT machine, the size of the cluster (Number of TT) and the number of counters. 
>> 
>> In terms of global accessibility... Maybe.
>> 
>> The reason I say maybe is that I'm not sure by what you mean by globally accessible. 
>> If a task creates and implements a dynamic counter... I know that it will eventually be reflected in the JT. However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous.  Not to mention that I don't believe that the counters are aggregated until the job ends. It would make sense that the JT maintains a unique counter for each task until the tasks complete. (If a task fails, it would have to delete the counters so that when the task is restarted the correct count is maintained. )  Note, I haven't looked at the source code so I am probably wrong. 
>> 
>> HTH
>> Mike
>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>> 
>>> Hi guys,
>>> 
>>> I have some quick questions regarding to Hadoop counter,
>>> 
>>> Hadoop counter (customer defined) is global accessible (for both read and write) for all Mappers and Reducers in a job?
>>> What is the performance and best practices of using Hadoop counters? I am not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole job?
>>> regards,
>>> Lin
>> 
>> 
> 
>

Re: Hadoop counter

Posted by Michael Segel <mi...@hotmail.com>.

Yeah, sorry... 

I meant that if you were dynamically creating a counter foo in the Mapper task, then each mapper would be creating their own counter foo. 
As the job runs, these counters will eventually be sent up to the JT. The job tracker would keep a separate counter for each task. 

At the end, the final count is aggregated from the list of counters for foo. 


I don't know how you can get a task to ask information from the Job Tracker on how things are going in other tasks.  That is what I meant that you couldn't get information about the other counters or even the status of the other tasks running in the same job. 

I didn't see anything in the APIs that allowed for that type of flow... Of course having said that... someone pops up with a way to do just that. ;-) 


Does that clarify things? 

-Mike


On Oct 19, 2012, at 11:56 AM, Lin Ma <li...@gmail.com> wrote:

> Hi Mike,
> 
> Sorry I am a bit lost... As you are thinking faster than me. :-P
> 
> From your this statement "It would make sense that the JT maintains a unique counter for each task until the tasks complete." -- it seems each task cannot see counters from each other, since JT maintains a unique counter for each tasks;
> 
> From your this comment "I meant that if a Task created and updated a counter, a different Task has access to that counter. " -- it seems different tasks could share/access the same counter.
> 
> Appreciate if you could help to clarify a bit.
> 
> regards,
> Lin
> 
> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <mi...@hotmail.com> wrote:
> 
> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
> 
>> Hi Mike,
>> 
>> Thanks for the detailed reply. Two quick questions/comments,
>> 
>> 1. For "task", you mean a specific mapper instance, or a specific reducer instance?
> 
> Either. 
> 
>> 2. "However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous." -- do you mean if a mapper is updating custom counter ABC, and another mapper is updating the same customer counter ABC, their counter values are updated independently by different mappers, and will not published (aggregated) externally until job completed successfully?
>> 
> I meant that if a Task created and updated a counter, a different Task has access to that counter. 
> 
> To give you an example, if I want to count the number of quality errors and then fail after X number of errors, I can't use Global counters to do this.
> 
>> regards,
>> Lin
>> 
>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <mi...@hotmail.com> wrote:
>> As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status.
>> The JT then will aggregate them. 
>> 
>> In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them, you can run in to issues. 
>> As to limits... I guess that will depend on the amount of memory on the JT machine, the size of the cluster (Number of TT) and the number of counters. 
>> 
>> In terms of global accessibility... Maybe.
>> 
>> The reason I say maybe is that I'm not sure by what you mean by globally accessible. 
>> If a task creates and implements a dynamic counter... I know that it will eventually be reflected in the JT. However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous.  Not to mention that I don't believe that the counters are aggregated until the job ends. It would make sense that the JT maintains a unique counter for each task until the tasks complete. (If a task fails, it would have to delete the counters so that when the task is restarted the correct count is maintained. )  Note, I haven't looked at the source code so I am probably wrong. 
>> 
>> HTH
>> Mike
>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>> 
>>> Hi guys,
>>> 
>>> I have some quick questions regarding to Hadoop counter,
>>> 
>>> Hadoop counter (customer defined) is global accessible (for both read and write) for all Mappers and Reducers in a job?
>>> What is the performance and best practices of using Hadoop counters? I am not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole job?
>>> regards,
>>> Lin
>> 
>> 
> 
>

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Hi Mike,

Sorry I am a bit lost... As you are thinking faster than me. :-P

>From your this statement "It would make sense that the JT maintains a
unique counter for each task until the tasks complete." -- it seems each
task cannot see counters from each other, since JT maintains a unique
counter for each tasks;

>From your this comment "I meant that if a Task created and updated a
counter, a different Task has access to that counter. " -- it seems
different tasks could share/access the same counter.

Appreciate if you could help to clarify a bit.

regards,
Lin

On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel
<mi...@hotmail.com>wrote:

>
> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>
> Hi Mike,
>
> Thanks for the detailed reply. Two quick questions/comments,
>
> 1. For "task", you mean a specific mapper instance, or a specific reducer
> instance?
>
>
> Either.
>
> 2. "However, I do not believe that a separate Task could connect with the
> JT and see if the counter exists or if it could get a value or even an
> accurate value since the updates are asynchronous." -- do you mean if a
> mapper is updating custom counter ABC, and another mapper is updating the
> same customer counter ABC, their counter values are updated independently
> by different mappers, and will not published (aggregated) externally until
> job completed successfully?
>
> I meant that if a Task created and updated a counter, a different Task has
> access to that counter.
>
> To give you an example, if I want to count the number of quality errors
> and then fail after X number of errors, I can't use Global counters to do
> this.
>
> regards,
> Lin
>
> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <michael_segel@hotmail.com
> > wrote:
>
>> As I understand it... each Task has its own counters and are
>> independently updated. As they report back to the JT, they update the
>> counter(s)' status.
>> The JT then will aggregate them.
>>
>> In terms of performance, Counters take up some memory in the JT so while
>> its OK to use them, if you abuse them, you can run in to issues.
>> As to limits... I guess that will depend on the amount of memory on the
>> JT machine, the size of the cluster (Number of TT) and the number of
>> counters.
>>
>> In terms of global accessibility... Maybe.
>>
>> The reason I say maybe is that I'm not sure by what you mean by globally
>> accessible.
>> If a task creates and implements a dynamic counter... I know that it will
>> eventually be reflected in the JT. However, I do not believe that a
>> separate Task could connect with the JT and see if the counter exists or if
>> it could get a value or even an accurate value since the updates are
>> asynchronous.  Not to mention that I don't believe that the counters are
>> aggregated until the job ends. It would make sense that the JT maintains a
>> unique counter for each task until the tasks complete. (If a task fails, it
>> would have to delete the counters so that when the task is restarted the
>> correct count is maintained. )  Note, I haven't looked at the source code
>> so I am probably wrong.
>>
>> HTH
>> Mike
>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>
>> Hi guys,
>>
>> I have some quick questions regarding to Hadoop counter,
>>
>>
>>    - Hadoop counter (customer defined) is global accessible (for both
>>    read and write) for all Mappers and Reducers in a job?
>>    - What is the performance and best practices of using Hadoop
>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>    performance downgrade to the whole job?
>>
>> regards,
>> Lin
>>
>>
>>
>
>

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Hi Mike,

Sorry I am a bit lost... As you are thinking faster than me. :-P

>From your this statement "It would make sense that the JT maintains a
unique counter for each task until the tasks complete." -- it seems each
task cannot see counters from each other, since JT maintains a unique
counter for each tasks;

>From your this comment "I meant that if a Task created and updated a
counter, a different Task has access to that counter. " -- it seems
different tasks could share/access the same counter.

Appreciate if you could help to clarify a bit.

regards,
Lin

On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel
<mi...@hotmail.com>wrote:

>
> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>
> Hi Mike,
>
> Thanks for the detailed reply. Two quick questions/comments,
>
> 1. For "task", you mean a specific mapper instance, or a specific reducer
> instance?
>
>
> Either.
>
> 2. "However, I do not believe that a separate Task could connect with the
> JT and see if the counter exists or if it could get a value or even an
> accurate value since the updates are asynchronous." -- do you mean if a
> mapper is updating custom counter ABC, and another mapper is updating the
> same customer counter ABC, their counter values are updated independently
> by different mappers, and will not published (aggregated) externally until
> job completed successfully?
>
> I meant that if a Task created and updated a counter, a different Task has
> access to that counter.
>
> To give you an example, if I want to count the number of quality errors
> and then fail after X number of errors, I can't use Global counters to do
> this.
>
> regards,
> Lin
>
> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <michael_segel@hotmail.com
> > wrote:
>
>> As I understand it... each Task has its own counters and are
>> independently updated. As they report back to the JT, they update the
>> counter(s)' status.
>> The JT then will aggregate them.
>>
>> In terms of performance, Counters take up some memory in the JT so while
>> its OK to use them, if you abuse them, you can run in to issues.
>> As to limits... I guess that will depend on the amount of memory on the
>> JT machine, the size of the cluster (Number of TT) and the number of
>> counters.
>>
>> In terms of global accessibility... Maybe.
>>
>> The reason I say maybe is that I'm not sure by what you mean by globally
>> accessible.
>> If a task creates and implements a dynamic counter... I know that it will
>> eventually be reflected in the JT. However, I do not believe that a
>> separate Task could connect with the JT and see if the counter exists or if
>> it could get a value or even an accurate value since the updates are
>> asynchronous.  Not to mention that I don't believe that the counters are
>> aggregated until the job ends. It would make sense that the JT maintains a
>> unique counter for each task until the tasks complete. (If a task fails, it
>> would have to delete the counters so that when the task is restarted the
>> correct count is maintained. )  Note, I haven't looked at the source code
>> so I am probably wrong.
>>
>> HTH
>> Mike
>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>
>> Hi guys,
>>
>> I have some quick questions regarding to Hadoop counter,
>>
>>
>>    - Hadoop counter (customer defined) is global accessible (for both
>>    read and write) for all Mappers and Reducers in a job?
>>    - What is the performance and best practices of using Hadoop
>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>    performance downgrade to the whole job?
>>
>> regards,
>> Lin
>>
>>
>>
>
>

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Hi Mike,

Sorry I am a bit lost... As you are thinking faster than me. :-P

>From your this statement "It would make sense that the JT maintains a
unique counter for each task until the tasks complete." -- it seems each
task cannot see counters from each other, since JT maintains a unique
counter for each tasks;

>From your this comment "I meant that if a Task created and updated a
counter, a different Task has access to that counter. " -- it seems
different tasks could share/access the same counter.

Appreciate if you could help to clarify a bit.

regards,
Lin

On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel
<mi...@hotmail.com>wrote:

>
> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>
> Hi Mike,
>
> Thanks for the detailed reply. Two quick questions/comments,
>
> 1. For "task", you mean a specific mapper instance, or a specific reducer
> instance?
>
>
> Either.
>
> 2. "However, I do not believe that a separate Task could connect with the
> JT and see if the counter exists or if it could get a value or even an
> accurate value since the updates are asynchronous." -- do you mean if a
> mapper is updating custom counter ABC, and another mapper is updating the
> same customer counter ABC, their counter values are updated independently
> by different mappers, and will not published (aggregated) externally until
> job completed successfully?
>
> I meant that if a Task created and updated a counter, a different Task has
> access to that counter.
>
> To give you an example, if I want to count the number of quality errors
> and then fail after X number of errors, I can't use Global counters to do
> this.
>
> regards,
> Lin
>
> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <michael_segel@hotmail.com
> > wrote:
>
>> As I understand it... each Task has its own counters and are
>> independently updated. As they report back to the JT, they update the
>> counter(s)' status.
>> The JT then will aggregate them.
>>
>> In terms of performance, Counters take up some memory in the JT so while
>> its OK to use them, if you abuse them, you can run in to issues.
>> As to limits... I guess that will depend on the amount of memory on the
>> JT machine, the size of the cluster (Number of TT) and the number of
>> counters.
>>
>> In terms of global accessibility... Maybe.
>>
>> The reason I say maybe is that I'm not sure by what you mean by globally
>> accessible.
>> If a task creates and implements a dynamic counter... I know that it will
>> eventually be reflected in the JT. However, I do not believe that a
>> separate Task could connect with the JT and see if the counter exists or if
>> it could get a value or even an accurate value since the updates are
>> asynchronous.  Not to mention that I don't believe that the counters are
>> aggregated until the job ends. It would make sense that the JT maintains a
>> unique counter for each task until the tasks complete. (If a task fails, it
>> would have to delete the counters so that when the task is restarted the
>> correct count is maintained. )  Note, I haven't looked at the source code
>> so I am probably wrong.
>>
>> HTH
>> Mike
>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>
>> Hi guys,
>>
>> I have some quick questions regarding to Hadoop counter,
>>
>>
>>    - Hadoop counter (customer defined) is global accessible (for both
>>    read and write) for all Mappers and Reducers in a job?
>>    - What is the performance and best practices of using Hadoop
>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>    performance downgrade to the whole job?
>>
>> regards,
>> Lin
>>
>>
>>
>
>

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Hi Mike,

Sorry I am a bit lost... As you are thinking faster than me. :-P

>From your this statement "It would make sense that the JT maintains a
unique counter for each task until the tasks complete." -- it seems each
task cannot see counters from each other, since JT maintains a unique
counter for each tasks;

>From your this comment "I meant that if a Task created and updated a
counter, a different Task has access to that counter. " -- it seems
different tasks could share/access the same counter.

Appreciate if you could help to clarify a bit.

regards,
Lin

On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel
<mi...@hotmail.com>wrote:

>
> On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:
>
> Hi Mike,
>
> Thanks for the detailed reply. Two quick questions/comments,
>
> 1. For "task", you mean a specific mapper instance, or a specific reducer
> instance?
>
>
> Either.
>
> 2. "However, I do not believe that a separate Task could connect with the
> JT and see if the counter exists or if it could get a value or even an
> accurate value since the updates are asynchronous." -- do you mean if a
> mapper is updating custom counter ABC, and another mapper is updating the
> same customer counter ABC, their counter values are updated independently
> by different mappers, and will not published (aggregated) externally until
> job completed successfully?
>
> I meant that if a Task created and updated a counter, a different Task has
> access to that counter.
>
> To give you an example, if I want to count the number of quality errors
> and then fail after X number of errors, I can't use Global counters to do
> this.
>
> regards,
> Lin
>
> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <michael_segel@hotmail.com
> > wrote:
>
>> As I understand it... each Task has its own counters and are
>> independently updated. As they report back to the JT, they update the
>> counter(s)' status.
>> The JT then will aggregate them.
>>
>> In terms of performance, Counters take up some memory in the JT so while
>> its OK to use them, if you abuse them, you can run in to issues.
>> As to limits... I guess that will depend on the amount of memory on the
>> JT machine, the size of the cluster (Number of TT) and the number of
>> counters.
>>
>> In terms of global accessibility... Maybe.
>>
>> The reason I say maybe is that I'm not sure by what you mean by globally
>> accessible.
>> If a task creates and implements a dynamic counter... I know that it will
>> eventually be reflected in the JT. However, I do not believe that a
>> separate Task could connect with the JT and see if the counter exists or if
>> it could get a value or even an accurate value since the updates are
>> asynchronous.  Not to mention that I don't believe that the counters are
>> aggregated until the job ends. It would make sense that the JT maintains a
>> unique counter for each task until the tasks complete. (If a task fails, it
>> would have to delete the counters so that when the task is restarted the
>> correct count is maintained. )  Note, I haven't looked at the source code
>> so I am probably wrong.
>>
>> HTH
>> Mike
>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>
>> Hi guys,
>>
>> I have some quick questions regarding to Hadoop counter,
>>
>>
>>    - Hadoop counter (customer defined) is global accessible (for both
>>    read and write) for all Mappers and Reducers in a job?
>>    - What is the performance and best practices of using Hadoop
>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>    performance downgrade to the whole job?
>>
>> regards,
>> Lin
>>
>>
>>
>
>

Re: Hadoop counter

Posted by Michael Segel <mi...@hotmail.com>.

On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:

> Hi Mike,
> 
> Thanks for the detailed reply. Two quick questions/comments,
> 
> 1. For "task", you mean a specific mapper instance, or a specific reducer instance?

Either. 

> 2. "However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous." -- do you mean if a mapper is updating custom counter ABC, and another mapper is updating the same customer counter ABC, their counter values are updated independently by different mappers, and will not published (aggregated) externally until job completed successfully?
> 
I meant that if a Task created and updated a counter, a different Task has access to that counter. 

To give you an example, if I want to count the number of quality errors and then fail after X number of errors, I can't use Global counters to do this.

> regards,
> Lin
> 
> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <mi...@hotmail.com> wrote:
> As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status.
> The JT then will aggregate them. 
> 
> In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them, you can run in to issues. 
> As to limits... I guess that will depend on the amount of memory on the JT machine, the size of the cluster (Number of TT) and the number of counters. 
> 
> In terms of global accessibility... Maybe.
> 
> The reason I say maybe is that I'm not sure by what you mean by globally accessible. 
> If a task creates and implements a dynamic counter... I know that it will eventually be reflected in the JT. However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous.  Not to mention that I don't believe that the counters are aggregated until the job ends. It would make sense that the JT maintains a unique counter for each task until the tasks complete. (If a task fails, it would have to delete the counters so that when the task is restarted the correct count is maintained. )  Note, I haven't looked at the source code so I am probably wrong. 
> 
> HTH
> Mike
> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
> 
>> Hi guys,
>> 
>> I have some quick questions regarding to Hadoop counter,
>> 
>> Hadoop counter (customer defined) is global accessible (for both read and write) for all Mappers and Reducers in a job?
>> What is the performance and best practices of using Hadoop counters? I am not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole job?
>> regards,
>> Lin
> 
>

Re: Hadoop counter

Posted by Michael Segel <mi...@hotmail.com>.

On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:

> Hi Mike,
> 
> Thanks for the detailed reply. Two quick questions/comments,
> 
> 1. For "task", you mean a specific mapper instance, or a specific reducer instance?

Either. 

> 2. "However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous." -- do you mean if a mapper is updating custom counter ABC, and another mapper is updating the same customer counter ABC, their counter values are updated independently by different mappers, and will not published (aggregated) externally until job completed successfully?
> 
I meant that if a Task created and updated a counter, a different Task has access to that counter. 

To give you an example, if I want to count the number of quality errors and then fail after X number of errors, I can't use Global counters to do this.

> regards,
> Lin
> 
> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <mi...@hotmail.com> wrote:
> As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status.
> The JT then will aggregate them. 
> 
> In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them, you can run in to issues. 
> As to limits... I guess that will depend on the amount of memory on the JT machine, the size of the cluster (Number of TT) and the number of counters. 
> 
> In terms of global accessibility... Maybe.
> 
> The reason I say maybe is that I'm not sure by what you mean by globally accessible. 
> If a task creates and implements a dynamic counter... I know that it will eventually be reflected in the JT. However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous.  Not to mention that I don't believe that the counters are aggregated until the job ends. It would make sense that the JT maintains a unique counter for each task until the tasks complete. (If a task fails, it would have to delete the counters so that when the task is restarted the correct count is maintained. )  Note, I haven't looked at the source code so I am probably wrong. 
> 
> HTH
> Mike
> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
> 
>> Hi guys,
>> 
>> I have some quick questions regarding to Hadoop counter,
>> 
>> Hadoop counter (customer defined) is global accessible (for both read and write) for all Mappers and Reducers in a job?
>> What is the performance and best practices of using Hadoop counters? I am not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole job?
>> regards,
>> Lin
> 
>

Re: Hadoop counter

Posted by Michael Segel <mi...@hotmail.com>.

On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:

> Hi Mike,
> 
> Thanks for the detailed reply. Two quick questions/comments,
> 
> 1. For "task", you mean a specific mapper instance, or a specific reducer instance?

Either. 

> 2. "However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous." -- do you mean if a mapper is updating custom counter ABC, and another mapper is updating the same customer counter ABC, their counter values are updated independently by different mappers, and will not published (aggregated) externally until job completed successfully?
> 
I meant that if a Task created and updated a counter, a different Task has access to that counter. 

To give you an example, if I want to count the number of quality errors and then fail after X number of errors, I can't use Global counters to do this.

> regards,
> Lin
> 
> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <mi...@hotmail.com> wrote:
> As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status.
> The JT then will aggregate them. 
> 
> In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them, you can run in to issues. 
> As to limits... I guess that will depend on the amount of memory on the JT machine, the size of the cluster (Number of TT) and the number of counters. 
> 
> In terms of global accessibility... Maybe.
> 
> The reason I say maybe is that I'm not sure by what you mean by globally accessible. 
> If a task creates and implements a dynamic counter... I know that it will eventually be reflected in the JT. However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous.  Not to mention that I don't believe that the counters are aggregated until the job ends. It would make sense that the JT maintains a unique counter for each task until the tasks complete. (If a task fails, it would have to delete the counters so that when the task is restarted the correct count is maintained. )  Note, I haven't looked at the source code so I am probably wrong. 
> 
> HTH
> Mike
> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
> 
>> Hi guys,
>> 
>> I have some quick questions regarding to Hadoop counter,
>> 
>> Hadoop counter (customer defined) is global accessible (for both read and write) for all Mappers and Reducers in a job?
>> What is the performance and best practices of using Hadoop counters? I am not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole job?
>> regards,
>> Lin
> 
>

Re: Hadoop counter

Posted by Michael Segel <mi...@hotmail.com>.

On Oct 19, 2012, at 11:27 AM, Lin Ma <li...@gmail.com> wrote:

> Hi Mike,
> 
> Thanks for the detailed reply. Two quick questions/comments,
> 
> 1. For "task", you mean a specific mapper instance, or a specific reducer instance?

Either. 

> 2. "However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous." -- do you mean if a mapper is updating custom counter ABC, and another mapper is updating the same customer counter ABC, their counter values are updated independently by different mappers, and will not published (aggregated) externally until job completed successfully?
> 
I meant that if a Task created and updated a counter, a different Task has access to that counter. 

To give you an example, if I want to count the number of quality errors and then fail after X number of errors, I can't use Global counters to do this.

> regards,
> Lin
> 
> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <mi...@hotmail.com> wrote:
> As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status.
> The JT then will aggregate them. 
> 
> In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them, you can run in to issues. 
> As to limits... I guess that will depend on the amount of memory on the JT machine, the size of the cluster (Number of TT) and the number of counters. 
> 
> In terms of global accessibility... Maybe.
> 
> The reason I say maybe is that I'm not sure by what you mean by globally accessible. 
> If a task creates and implements a dynamic counter... I know that it will eventually be reflected in the JT. However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous.  Not to mention that I don't believe that the counters are aggregated until the job ends. It would make sense that the JT maintains a unique counter for each task until the tasks complete. (If a task fails, it would have to delete the counters so that when the task is restarted the correct count is maintained. )  Note, I haven't looked at the source code so I am probably wrong. 
> 
> HTH
> Mike
> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
> 
>> Hi guys,
>> 
>> I have some quick questions regarding to Hadoop counter,
>> 
>> Hadoop counter (customer defined) is global accessible (for both read and write) for all Mappers and Reducers in a job?
>> What is the performance and best practices of using Hadoop counters? I am not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole job?
>> regards,
>> Lin
> 
>

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Hi Mike,

Thanks for the detailed reply. Two quick questions/comments,

1. For "task", you mean a specific mapper instance, or a specific reducer
instance?
2. "However, I do not believe that a separate Task could connect with the
JT and see if the counter exists or if it could get a value or even an
accurate value since the updates are asynchronous." -- do you mean if a
mapper is updating custom counter ABC, and another mapper is updating the
same customer counter ABC, their counter values are updated independently
by different mappers, and will not published (aggregated) externally until
job completed successfully?

regards,
Lin

On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel
<mi...@hotmail.com>wrote:

> As I understand it... each Task has its own counters and are independently
> updated. As they report back to the JT, they update the counter(s)' status.
> The JT then will aggregate them.
>
> In terms of performance, Counters take up some memory in the JT so while
> its OK to use them, if you abuse them, you can run in to issues.
> As to limits... I guess that will depend on the amount of memory on the JT
> machine, the size of the cluster (Number of TT) and the number of counters.
>
> In terms of global accessibility... Maybe.
>
> The reason I say maybe is that I'm not sure by what you mean by globally
> accessible.
> If a task creates and implements a dynamic counter... I know that it will
> eventually be reflected in the JT. However, I do not believe that a
> separate Task could connect with the JT and see if the counter exists or if
> it could get a value or even an accurate value since the updates are
> asynchronous.  Not to mention that I don't believe that the counters are
> aggregated until the job ends. It would make sense that the JT maintains a
> unique counter for each task until the tasks complete. (If a task fails, it
> would have to delete the counters so that when the task is restarted the
> correct count is maintained. )  Note, I haven't looked at the source code
> so I am probably wrong.
>
> HTH
> Mike
> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>
> Hi guys,
>
> I have some quick questions regarding to Hadoop counter,
>
>
>    - Hadoop counter (customer defined) is global accessible (for both
>    read and write) for all Mappers and Reducers in a job?
>    - What is the performance and best practices of using Hadoop counters?
>    I am not sure if using Hadoop counters too heavy, there will be performance
>    downgrade to the whole job?
>
> regards,
> Lin
>
>
>

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Hi Mike,

Thanks for the detailed reply. Two quick questions/comments,

1. For "task", you mean a specific mapper instance, or a specific reducer
instance?
2. "However, I do not believe that a separate Task could connect with the
JT and see if the counter exists or if it could get a value or even an
accurate value since the updates are asynchronous." -- do you mean if a
mapper is updating custom counter ABC, and another mapper is updating the
same customer counter ABC, their counter values are updated independently
by different mappers, and will not published (aggregated) externally until
job completed successfully?

regards,
Lin

On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel
<mi...@hotmail.com>wrote:

> As I understand it... each Task has its own counters and are independently
> updated. As they report back to the JT, they update the counter(s)' status.
> The JT then will aggregate them.
>
> In terms of performance, Counters take up some memory in the JT so while
> its OK to use them, if you abuse them, you can run in to issues.
> As to limits... I guess that will depend on the amount of memory on the JT
> machine, the size of the cluster (Number of TT) and the number of counters.
>
> In terms of global accessibility... Maybe.
>
> The reason I say maybe is that I'm not sure by what you mean by globally
> accessible.
> If a task creates and implements a dynamic counter... I know that it will
> eventually be reflected in the JT. However, I do not believe that a
> separate Task could connect with the JT and see if the counter exists or if
> it could get a value or even an accurate value since the updates are
> asynchronous.  Not to mention that I don't believe that the counters are
> aggregated until the job ends. It would make sense that the JT maintains a
> unique counter for each task until the tasks complete. (If a task fails, it
> would have to delete the counters so that when the task is restarted the
> correct count is maintained. )  Note, I haven't looked at the source code
> so I am probably wrong.
>
> HTH
> Mike
> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>
> Hi guys,
>
> I have some quick questions regarding to Hadoop counter,
>
>
>    - Hadoop counter (customer defined) is global accessible (for both
>    read and write) for all Mappers and Reducers in a job?
>    - What is the performance and best practices of using Hadoop counters?
>    I am not sure if using Hadoop counters too heavy, there will be performance
>    downgrade to the whole job?
>
> regards,
> Lin
>
>
>

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Hi Mike,

Thanks for the detailed reply. Two quick questions/comments,

1. For "task", you mean a specific mapper instance, or a specific reducer
instance?
2. "However, I do not believe that a separate Task could connect with the
JT and see if the counter exists or if it could get a value or even an
accurate value since the updates are asynchronous." -- do you mean if a
mapper is updating custom counter ABC, and another mapper is updating the
same customer counter ABC, their counter values are updated independently
by different mappers, and will not published (aggregated) externally until
job completed successfully?

regards,
Lin

On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel
<mi...@hotmail.com>wrote:

> As I understand it... each Task has its own counters and are independently
> updated. As they report back to the JT, they update the counter(s)' status.
> The JT then will aggregate them.
>
> In terms of performance, Counters take up some memory in the JT so while
> its OK to use them, if you abuse them, you can run in to issues.
> As to limits... I guess that will depend on the amount of memory on the JT
> machine, the size of the cluster (Number of TT) and the number of counters.
>
> In terms of global accessibility... Maybe.
>
> The reason I say maybe is that I'm not sure by what you mean by globally
> accessible.
> If a task creates and implements a dynamic counter... I know that it will
> eventually be reflected in the JT. However, I do not believe that a
> separate Task could connect with the JT and see if the counter exists or if
> it could get a value or even an accurate value since the updates are
> asynchronous.  Not to mention that I don't believe that the counters are
> aggregated until the job ends. It would make sense that the JT maintains a
> unique counter for each task until the tasks complete. (If a task fails, it
> would have to delete the counters so that when the task is restarted the
> correct count is maintained. )  Note, I haven't looked at the source code
> so I am probably wrong.
>
> HTH
> Mike
> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>
> Hi guys,
>
> I have some quick questions regarding to Hadoop counter,
>
>
>    - Hadoop counter (customer defined) is global accessible (for both
>    read and write) for all Mappers and Reducers in a job?
>    - What is the performance and best practices of using Hadoop counters?
>    I am not sure if using Hadoop counters too heavy, there will be performance
>    downgrade to the whole job?
>
> regards,
> Lin
>
>
>

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Hi Mike,

Thanks for the detailed reply. Two quick questions/comments,

1. For "task", you mean a specific mapper instance, or a specific reducer
instance?
2. "However, I do not believe that a separate Task could connect with the
JT and see if the counter exists or if it could get a value or even an
accurate value since the updates are asynchronous." -- do you mean if a
mapper is updating custom counter ABC, and another mapper is updating the
same customer counter ABC, their counter values are updated independently
by different mappers, and will not published (aggregated) externally until
job completed successfully?

regards,
Lin

On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel
<mi...@hotmail.com>wrote:

> As I understand it... each Task has its own counters and are independently
> updated. As they report back to the JT, they update the counter(s)' status.
> The JT then will aggregate them.
>
> In terms of performance, Counters take up some memory in the JT so while
> its OK to use them, if you abuse them, you can run in to issues.
> As to limits... I guess that will depend on the amount of memory on the JT
> machine, the size of the cluster (Number of TT) and the number of counters.
>
> In terms of global accessibility... Maybe.
>
> The reason I say maybe is that I'm not sure by what you mean by globally
> accessible.
> If a task creates and implements a dynamic counter... I know that it will
> eventually be reflected in the JT. However, I do not believe that a
> separate Task could connect with the JT and see if the counter exists or if
> it could get a value or even an accurate value since the updates are
> asynchronous.  Not to mention that I don't believe that the counters are
> aggregated until the job ends. It would make sense that the JT maintains a
> unique counter for each task until the tasks complete. (If a task fails, it
> would have to delete the counters so that when the task is restarted the
> correct count is maintained. )  Note, I haven't looked at the source code
> so I am probably wrong.
>
> HTH
> Mike
> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>
> Hi guys,
>
> I have some quick questions regarding to Hadoop counter,
>
>
>    - Hadoop counter (customer defined) is global accessible (for both
>    read and write) for all Mappers and Reducers in a job?
>    - What is the performance and best practices of using Hadoop counters?
>    I am not sure if using Hadoop counters too heavy, there will be performance
>    downgrade to the whole job?
>
> regards,
> Lin
>
>
>

Re: Hadoop counter

Posted by Harsh J <ha...@cloudera.com>.

Hi,

There're essentially 4 lists on a live JT: Running, Completed,
Failed/Killed, Retired. The first 3 have counters in memory, the last
has it on disk. Completed and Failed/Killed jobs are sent to Retired
(on-disk persistence, and garbage collected out of memory), after a
default period of 24 hours post-finish time.

On Fri, Oct 19, 2012 at 10:03 PM, Lin Ma <li...@gmail.com> wrote:
> Hi Harsh,
>
> Thanks for the brilliant reply.
>
> For your comments -- "Yes, they are ultimately stored at JT until the job is
> retired out of
>
> heap memory (in which case, they get stored into the JobHistory
> location and format).", does it mean only running job's counter will consume
> JT memory, for completed job, counter will be stored in disk (I think for
> "JobHistory location and format" is on disk?)?
>
> regards,
> Lin
>
>
> On Sat, Oct 20, 2012 at 12:19 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi,
>>
>> Inline.
>>
>> On Fri, Oct 19, 2012 at 9:39 PM, Lin Ma <li...@gmail.com> wrote:
>> > Hi Harsh,
>> >
>> > Thanks for the great reply. Two basic questions,
>> >
>> > - Where the counters' value are stored for successful job? On JT?
>>
>> Yes, they are ultimately stored at JT until the job is retired out of
>> heap memory (in which case, they get stored into the JobHistory
>> location and format).
>>
>> > - Supposing a specific job A completed successfully and updated related
>> > counters, is it possible for another specific job B to read counters
>> > updated
>> > by previous job A? If yes, how?
>>
>> Yes, possible, use the RunningJob object from the previous job (or
>> capture one) and query it. APIs you're interested in:
>>
>> Grab a query-able object (RunningJob and/or a Job):
>>
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/JobClient.html#getJob(org.apache.hadoop.mapred.JobID)
>> or
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Cluster.html#getJob(org.apache.hadoop.mapreduce.JobID)
>>
>> Query counters:
>>
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/RunningJob.html#getCounters()
>> or
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Job.html#getCounters()
>>
>> > regards,
>> > Lin
>> >
>> >
>> > On Fri, Oct 19, 2012 at 11:50 PM, Harsh J <ha...@cloudera.com> wrote:
>> >>
>> >> Bejoy is almost right, except that counters are reported upon progress
>> >> of tasks itself (via TT heartbeats to JT actually), but the final
>> >> counter representation is computed only with successful task reports
>> >> the job received, not from any failed or killed ones.
>> >>
>> >> On Fri, Oct 19, 2012 at 8:51 PM, Bejoy KS <be...@gmail.com>
>> >> wrote:
>> >> > Hi Jay
>> >> >
>> >> > Counters are reported at the end of a task to JT. So if a task fails
>> >> > the
>> >> > counters from that task are not send to JT and hence won't be
>> >> > included
>> >> > in
>> >> > the final value of counters from that Job.
>> >> > Regards
>> >> > Bejoy KS
>> >> >
>> >> > Sent from handheld, please excuse typos.
>> >> > ________________________________
>> >> > From: Jay Vyas <ja...@gmail.com>
>> >> > Date: Fri, 19 Oct 2012 10:18:42 -0500
>> >> > To: <us...@hadoop.apache.org>
>> >> > ReplyTo: user@hadoop.apache.org
>> >> > Subject: Re: Hadoop counter
>> >> >
>> >> > Ah this answers alot about why some of my dynamic counters never show
>> >> > up
>> >> > and
>> >> > i have to bite my nails waiting to see whats going on until the end
>> >> > of
>> >> > the
>> >> > job- thanks.
>> >> >
>> >> > Another question: what happens if a task fails ?  What happen to the
>> >> > counters for it ?  Do they dissappear into the ether? Or do they get
>> >> > merged
>> >> > in with the counters from other tasks?
>> >> >
>> >> > On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux
>> >> > <de...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> And by default the number of counters is limited to 120 with the
>> >> >> mapreduce.job.counters.limit property.
>> >> >> They are useful for displaying short statistics about a job but
>> >> >> should
>> >> >> not
>> >> >> be used for results (imho).
>> >> >> I know people may misuse them but I haven't tried so I wouldn't be
>> >> >> able
>> >> >> to
>> >> >> list the caveats.
>> >> >>
>> >> >> Regards
>> >> >>
>> >> >> Bertrand
>> >> >>
>> >> >>
>> >> >> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel
>> >> >> <mi...@hotmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> As I understand it... each Task has its own counters and are
>> >> >>> independently updated. As they report back to the JT, they update
>> >> >>> the
>> >> >>> counter(s)' status.
>> >> >>> The JT then will aggregate them.
>> >> >>>
>> >> >>> In terms of performance, Counters take up some memory in the JT so
>> >> >>> while
>> >> >>> its OK to use them, if you abuse them, you can run in to issues.
>> >> >>> As to limits... I guess that will depend on the amount of memory on
>> >> >>> the
>> >> >>> JT machine, the size of the cluster (Number of TT) and the number
>> >> >>> of
>> >> >>> counters.
>> >> >>>
>> >> >>> In terms of global accessibility... Maybe.
>> >> >>>
>> >> >>> The reason I say maybe is that I'm not sure by what you mean by
>> >> >>> globally
>> >> >>> accessible.
>> >> >>> If a task creates and implements a dynamic counter... I know that
>> >> >>> it
>> >> >>> will
>> >> >>> eventually be reflected in the JT. However, I do not believe that a
>> >> >>> separate
>> >> >>> Task could connect with the JT and see if the counter exists or if
>> >> >>> it
>> >> >>> could
>> >> >>> get a value or even an accurate value since the updates are
>> >> >>> asynchronous.
>> >> >>> Not to mention that I don't believe that the counters are
>> >> >>> aggregated
>> >> >>> until
>> >> >>> the job ends. It would make sense that the JT maintains a unique
>> >> >>> counter for
>> >> >>> each task until the tasks complete. (If a task fails, it would have
>> >> >>> to
>> >> >>> delete the counters so that when the task is restarted the correct
>> >> >>> count is
>> >> >>> maintained. )  Note, I haven't looked at the source code so I am
>> >> >>> probably
>> >> >>> wrong.
>> >> >>>
>> >> >>> HTH
>> >> >>> Mike
>> >> >>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>> >> >>>
>> >> >>> Hi guys,
>> >> >>>
>> >> >>> I have some quick questions regarding to Hadoop counter,
>> >> >>>
>> >> >>> Hadoop counter (customer defined) is global accessible (for both
>> >> >>> read
>> >> >>> and
>> >> >>> write) for all Mappers and Reducers in a job?
>> >> >>> What is the performance and best practices of using Hadoop
>> >> >>> counters? I
>> >> >>> am
>> >> >>> not sure if using Hadoop counters too heavy, there will be
>> >> >>> performance
>> >> >>> downgrade to the whole job?
>> >> >>>
>> >> >>> regards,
>> >> >>> Lin
>> >> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Bertrand Dechoux
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Jay Vyas
>> >> > http://jayunit100.blogspot.com
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: Hadoop counter

Posted by Harsh J <ha...@cloudera.com>.

Hi,

There're essentially 4 lists on a live JT: Running, Completed,
Failed/Killed, Retired. The first 3 have counters in memory, the last
has it on disk. Completed and Failed/Killed jobs are sent to Retired
(on-disk persistence, and garbage collected out of memory), after a
default period of 24 hours post-finish time.

On Fri, Oct 19, 2012 at 10:03 PM, Lin Ma <li...@gmail.com> wrote:
> Hi Harsh,
>
> Thanks for the brilliant reply.
>
> For your comments -- "Yes, they are ultimately stored at JT until the job is
> retired out of
>
> heap memory (in which case, they get stored into the JobHistory
> location and format).", does it mean only running job's counter will consume
> JT memory, for completed job, counter will be stored in disk (I think for
> "JobHistory location and format" is on disk?)?
>
> regards,
> Lin
>
>
> On Sat, Oct 20, 2012 at 12:19 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi,
>>
>> Inline.
>>
>> On Fri, Oct 19, 2012 at 9:39 PM, Lin Ma <li...@gmail.com> wrote:
>> > Hi Harsh,
>> >
>> > Thanks for the great reply. Two basic questions,
>> >
>> > - Where the counters' value are stored for successful job? On JT?
>>
>> Yes, they are ultimately stored at JT until the job is retired out of
>> heap memory (in which case, they get stored into the JobHistory
>> location and format).
>>
>> > - Supposing a specific job A completed successfully and updated related
>> > counters, is it possible for another specific job B to read counters
>> > updated
>> > by previous job A? If yes, how?
>>
>> Yes, possible, use the RunningJob object from the previous job (or
>> capture one) and query it. APIs you're interested in:
>>
>> Grab a query-able object (RunningJob and/or a Job):
>>
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/JobClient.html#getJob(org.apache.hadoop.mapred.JobID)
>> or
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Cluster.html#getJob(org.apache.hadoop.mapreduce.JobID)
>>
>> Query counters:
>>
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/RunningJob.html#getCounters()
>> or
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Job.html#getCounters()
>>
>> > regards,
>> > Lin
>> >
>> >
>> > On Fri, Oct 19, 2012 at 11:50 PM, Harsh J <ha...@cloudera.com> wrote:
>> >>
>> >> Bejoy is almost right, except that counters are reported upon progress
>> >> of tasks itself (via TT heartbeats to JT actually), but the final
>> >> counter representation is computed only with successful task reports
>> >> the job received, not from any failed or killed ones.
>> >>
>> >> On Fri, Oct 19, 2012 at 8:51 PM, Bejoy KS <be...@gmail.com>
>> >> wrote:
>> >> > Hi Jay
>> >> >
>> >> > Counters are reported at the end of a task to JT. So if a task fails
>> >> > the
>> >> > counters from that task are not send to JT and hence won't be
>> >> > included
>> >> > in
>> >> > the final value of counters from that Job.
>> >> > Regards
>> >> > Bejoy KS
>> >> >
>> >> > Sent from handheld, please excuse typos.
>> >> > ________________________________
>> >> > From: Jay Vyas <ja...@gmail.com>
>> >> > Date: Fri, 19 Oct 2012 10:18:42 -0500
>> >> > To: <us...@hadoop.apache.org>
>> >> > ReplyTo: user@hadoop.apache.org
>> >> > Subject: Re: Hadoop counter
>> >> >
>> >> > Ah this answers alot about why some of my dynamic counters never show
>> >> > up
>> >> > and
>> >> > i have to bite my nails waiting to see whats going on until the end
>> >> > of
>> >> > the
>> >> > job- thanks.
>> >> >
>> >> > Another question: what happens if a task fails ?  What happen to the
>> >> > counters for it ?  Do they dissappear into the ether? Or do they get
>> >> > merged
>> >> > in with the counters from other tasks?
>> >> >
>> >> > On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux
>> >> > <de...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> And by default the number of counters is limited to 120 with the
>> >> >> mapreduce.job.counters.limit property.
>> >> >> They are useful for displaying short statistics about a job but
>> >> >> should
>> >> >> not
>> >> >> be used for results (imho).
>> >> >> I know people may misuse them but I haven't tried so I wouldn't be
>> >> >> able
>> >> >> to
>> >> >> list the caveats.
>> >> >>
>> >> >> Regards
>> >> >>
>> >> >> Bertrand
>> >> >>
>> >> >>
>> >> >> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel
>> >> >> <mi...@hotmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> As I understand it... each Task has its own counters and are
>> >> >>> independently updated. As they report back to the JT, they update
>> >> >>> the
>> >> >>> counter(s)' status.
>> >> >>> The JT then will aggregate them.
>> >> >>>
>> >> >>> In terms of performance, Counters take up some memory in the JT so
>> >> >>> while
>> >> >>> its OK to use them, if you abuse them, you can run in to issues.
>> >> >>> As to limits... I guess that will depend on the amount of memory on
>> >> >>> the
>> >> >>> JT machine, the size of the cluster (Number of TT) and the number
>> >> >>> of
>> >> >>> counters.
>> >> >>>
>> >> >>> In terms of global accessibility... Maybe.
>> >> >>>
>> >> >>> The reason I say maybe is that I'm not sure by what you mean by
>> >> >>> globally
>> >> >>> accessible.
>> >> >>> If a task creates and implements a dynamic counter... I know that
>> >> >>> it
>> >> >>> will
>> >> >>> eventually be reflected in the JT. However, I do not believe that a
>> >> >>> separate
>> >> >>> Task could connect with the JT and see if the counter exists or if
>> >> >>> it
>> >> >>> could
>> >> >>> get a value or even an accurate value since the updates are
>> >> >>> asynchronous.
>> >> >>> Not to mention that I don't believe that the counters are
>> >> >>> aggregated
>> >> >>> until
>> >> >>> the job ends. It would make sense that the JT maintains a unique
>> >> >>> counter for
>> >> >>> each task until the tasks complete. (If a task fails, it would have
>> >> >>> to
>> >> >>> delete the counters so that when the task is restarted the correct
>> >> >>> count is
>> >> >>> maintained. )  Note, I haven't looked at the source code so I am
>> >> >>> probably
>> >> >>> wrong.
>> >> >>>
>> >> >>> HTH
>> >> >>> Mike
>> >> >>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>> >> >>>
>> >> >>> Hi guys,
>> >> >>>
>> >> >>> I have some quick questions regarding to Hadoop counter,
>> >> >>>
>> >> >>> Hadoop counter (customer defined) is global accessible (for both
>> >> >>> read
>> >> >>> and
>> >> >>> write) for all Mappers and Reducers in a job?
>> >> >>> What is the performance and best practices of using Hadoop
>> >> >>> counters? I
>> >> >>> am
>> >> >>> not sure if using Hadoop counters too heavy, there will be
>> >> >>> performance
>> >> >>> downgrade to the whole job?
>> >> >>>
>> >> >>> regards,
>> >> >>> Lin
>> >> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Bertrand Dechoux
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Jay Vyas
>> >> > http://jayunit100.blogspot.com
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: Hadoop counter

Posted by Harsh J <ha...@cloudera.com>.

Hi,

There're essentially 4 lists on a live JT: Running, Completed,
Failed/Killed, Retired. The first 3 have counters in memory, the last
has it on disk. Completed and Failed/Killed jobs are sent to Retired
(on-disk persistence, and garbage collected out of memory), after a
default period of 24 hours post-finish time.

On Fri, Oct 19, 2012 at 10:03 PM, Lin Ma <li...@gmail.com> wrote:
> Hi Harsh,
>
> Thanks for the brilliant reply.
>
> For your comments -- "Yes, they are ultimately stored at JT until the job is
> retired out of
>
> heap memory (in which case, they get stored into the JobHistory
> location and format).", does it mean only running job's counter will consume
> JT memory, for completed job, counter will be stored in disk (I think for
> "JobHistory location and format" is on disk?)?
>
> regards,
> Lin
>
>
> On Sat, Oct 20, 2012 at 12:19 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi,
>>
>> Inline.
>>
>> On Fri, Oct 19, 2012 at 9:39 PM, Lin Ma <li...@gmail.com> wrote:
>> > Hi Harsh,
>> >
>> > Thanks for the great reply. Two basic questions,
>> >
>> > - Where the counters' value are stored for successful job? On JT?
>>
>> Yes, they are ultimately stored at JT until the job is retired out of
>> heap memory (in which case, they get stored into the JobHistory
>> location and format).
>>
>> > - Supposing a specific job A completed successfully and updated related
>> > counters, is it possible for another specific job B to read counters
>> > updated
>> > by previous job A? If yes, how?
>>
>> Yes, possible, use the RunningJob object from the previous job (or
>> capture one) and query it. APIs you're interested in:
>>
>> Grab a query-able object (RunningJob and/or a Job):
>>
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/JobClient.html#getJob(org.apache.hadoop.mapred.JobID)
>> or
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Cluster.html#getJob(org.apache.hadoop.mapreduce.JobID)
>>
>> Query counters:
>>
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/RunningJob.html#getCounters()
>> or
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Job.html#getCounters()
>>
>> > regards,
>> > Lin
>> >
>> >
>> > On Fri, Oct 19, 2012 at 11:50 PM, Harsh J <ha...@cloudera.com> wrote:
>> >>
>> >> Bejoy is almost right, except that counters are reported upon progress
>> >> of tasks itself (via TT heartbeats to JT actually), but the final
>> >> counter representation is computed only with successful task reports
>> >> the job received, not from any failed or killed ones.
>> >>
>> >> On Fri, Oct 19, 2012 at 8:51 PM, Bejoy KS <be...@gmail.com>
>> >> wrote:
>> >> > Hi Jay
>> >> >
>> >> > Counters are reported at the end of a task to JT. So if a task fails
>> >> > the
>> >> > counters from that task are not send to JT and hence won't be
>> >> > included
>> >> > in
>> >> > the final value of counters from that Job.
>> >> > Regards
>> >> > Bejoy KS
>> >> >
>> >> > Sent from handheld, please excuse typos.
>> >> > ________________________________
>> >> > From: Jay Vyas <ja...@gmail.com>
>> >> > Date: Fri, 19 Oct 2012 10:18:42 -0500
>> >> > To: <us...@hadoop.apache.org>
>> >> > ReplyTo: user@hadoop.apache.org
>> >> > Subject: Re: Hadoop counter
>> >> >
>> >> > Ah this answers alot about why some of my dynamic counters never show
>> >> > up
>> >> > and
>> >> > i have to bite my nails waiting to see whats going on until the end
>> >> > of
>> >> > the
>> >> > job- thanks.
>> >> >
>> >> > Another question: what happens if a task fails ?  What happen to the
>> >> > counters for it ?  Do they dissappear into the ether? Or do they get
>> >> > merged
>> >> > in with the counters from other tasks?
>> >> >
>> >> > On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux
>> >> > <de...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> And by default the number of counters is limited to 120 with the
>> >> >> mapreduce.job.counters.limit property.
>> >> >> They are useful for displaying short statistics about a job but
>> >> >> should
>> >> >> not
>> >> >> be used for results (imho).
>> >> >> I know people may misuse them but I haven't tried so I wouldn't be
>> >> >> able
>> >> >> to
>> >> >> list the caveats.
>> >> >>
>> >> >> Regards
>> >> >>
>> >> >> Bertrand
>> >> >>
>> >> >>
>> >> >> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel
>> >> >> <mi...@hotmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> As I understand it... each Task has its own counters and are
>> >> >>> independently updated. As they report back to the JT, they update
>> >> >>> the
>> >> >>> counter(s)' status.
>> >> >>> The JT then will aggregate them.
>> >> >>>
>> >> >>> In terms of performance, Counters take up some memory in the JT so
>> >> >>> while
>> >> >>> its OK to use them, if you abuse them, you can run in to issues.
>> >> >>> As to limits... I guess that will depend on the amount of memory on
>> >> >>> the
>> >> >>> JT machine, the size of the cluster (Number of TT) and the number
>> >> >>> of
>> >> >>> counters.
>> >> >>>
>> >> >>> In terms of global accessibility... Maybe.
>> >> >>>
>> >> >>> The reason I say maybe is that I'm not sure by what you mean by
>> >> >>> globally
>> >> >>> accessible.
>> >> >>> If a task creates and implements a dynamic counter... I know that
>> >> >>> it
>> >> >>> will
>> >> >>> eventually be reflected in the JT. However, I do not believe that a
>> >> >>> separate
>> >> >>> Task could connect with the JT and see if the counter exists or if
>> >> >>> it
>> >> >>> could
>> >> >>> get a value or even an accurate value since the updates are
>> >> >>> asynchronous.
>> >> >>> Not to mention that I don't believe that the counters are
>> >> >>> aggregated
>> >> >>> until
>> >> >>> the job ends. It would make sense that the JT maintains a unique
>> >> >>> counter for
>> >> >>> each task until the tasks complete. (If a task fails, it would have
>> >> >>> to
>> >> >>> delete the counters so that when the task is restarted the correct
>> >> >>> count is
>> >> >>> maintained. )  Note, I haven't looked at the source code so I am
>> >> >>> probably
>> >> >>> wrong.
>> >> >>>
>> >> >>> HTH
>> >> >>> Mike
>> >> >>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>> >> >>>
>> >> >>> Hi guys,
>> >> >>>
>> >> >>> I have some quick questions regarding to Hadoop counter,
>> >> >>>
>> >> >>> Hadoop counter (customer defined) is global accessible (for both
>> >> >>> read
>> >> >>> and
>> >> >>> write) for all Mappers and Reducers in a job?
>> >> >>> What is the performance and best practices of using Hadoop
>> >> >>> counters? I
>> >> >>> am
>> >> >>> not sure if using Hadoop counters too heavy, there will be
>> >> >>> performance
>> >> >>> downgrade to the whole job?
>> >> >>>
>> >> >>> regards,
>> >> >>> Lin
>> >> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Bertrand Dechoux
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Jay Vyas
>> >> > http://jayunit100.blogspot.com
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: Hadoop counter

Posted by Harsh J <ha...@cloudera.com>.

Hi,

There're essentially 4 lists on a live JT: Running, Completed,
Failed/Killed, Retired. The first 3 have counters in memory, the last
has it on disk. Completed and Failed/Killed jobs are sent to Retired
(on-disk persistence, and garbage collected out of memory), after a
default period of 24 hours post-finish time.

On Fri, Oct 19, 2012 at 10:03 PM, Lin Ma <li...@gmail.com> wrote:
> Hi Harsh,
>
> Thanks for the brilliant reply.
>
> For your comments -- "Yes, they are ultimately stored at JT until the job is
> retired out of
>
> heap memory (in which case, they get stored into the JobHistory
> location and format).", does it mean only running job's counter will consume
> JT memory, for completed job, counter will be stored in disk (I think for
> "JobHistory location and format" is on disk?)?
>
> regards,
> Lin
>
>
> On Sat, Oct 20, 2012 at 12:19 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi,
>>
>> Inline.
>>
>> On Fri, Oct 19, 2012 at 9:39 PM, Lin Ma <li...@gmail.com> wrote:
>> > Hi Harsh,
>> >
>> > Thanks for the great reply. Two basic questions,
>> >
>> > - Where the counters' value are stored for successful job? On JT?
>>
>> Yes, they are ultimately stored at JT until the job is retired out of
>> heap memory (in which case, they get stored into the JobHistory
>> location and format).
>>
>> > - Supposing a specific job A completed successfully and updated related
>> > counters, is it possible for another specific job B to read counters
>> > updated
>> > by previous job A? If yes, how?
>>
>> Yes, possible, use the RunningJob object from the previous job (or
>> capture one) and query it. APIs you're interested in:
>>
>> Grab a query-able object (RunningJob and/or a Job):
>>
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/JobClient.html#getJob(org.apache.hadoop.mapred.JobID)
>> or
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Cluster.html#getJob(org.apache.hadoop.mapreduce.JobID)
>>
>> Query counters:
>>
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/RunningJob.html#getCounters()
>> or
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Job.html#getCounters()
>>
>> > regards,
>> > Lin
>> >
>> >
>> > On Fri, Oct 19, 2012 at 11:50 PM, Harsh J <ha...@cloudera.com> wrote:
>> >>
>> >> Bejoy is almost right, except that counters are reported upon progress
>> >> of tasks itself (via TT heartbeats to JT actually), but the final
>> >> counter representation is computed only with successful task reports
>> >> the job received, not from any failed or killed ones.
>> >>
>> >> On Fri, Oct 19, 2012 at 8:51 PM, Bejoy KS <be...@gmail.com>
>> >> wrote:
>> >> > Hi Jay
>> >> >
>> >> > Counters are reported at the end of a task to JT. So if a task fails
>> >> > the
>> >> > counters from that task are not send to JT and hence won't be
>> >> > included
>> >> > in
>> >> > the final value of counters from that Job.
>> >> > Regards
>> >> > Bejoy KS
>> >> >
>> >> > Sent from handheld, please excuse typos.
>> >> > ________________________________
>> >> > From: Jay Vyas <ja...@gmail.com>
>> >> > Date: Fri, 19 Oct 2012 10:18:42 -0500
>> >> > To: <us...@hadoop.apache.org>
>> >> > ReplyTo: user@hadoop.apache.org
>> >> > Subject: Re: Hadoop counter
>> >> >
>> >> > Ah this answers alot about why some of my dynamic counters never show
>> >> > up
>> >> > and
>> >> > i have to bite my nails waiting to see whats going on until the end
>> >> > of
>> >> > the
>> >> > job- thanks.
>> >> >
>> >> > Another question: what happens if a task fails ?  What happen to the
>> >> > counters for it ?  Do they dissappear into the ether? Or do they get
>> >> > merged
>> >> > in with the counters from other tasks?
>> >> >
>> >> > On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux
>> >> > <de...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> And by default the number of counters is limited to 120 with the
>> >> >> mapreduce.job.counters.limit property.
>> >> >> They are useful for displaying short statistics about a job but
>> >> >> should
>> >> >> not
>> >> >> be used for results (imho).
>> >> >> I know people may misuse them but I haven't tried so I wouldn't be
>> >> >> able
>> >> >> to
>> >> >> list the caveats.
>> >> >>
>> >> >> Regards
>> >> >>
>> >> >> Bertrand
>> >> >>
>> >> >>
>> >> >> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel
>> >> >> <mi...@hotmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> As I understand it... each Task has its own counters and are
>> >> >>> independently updated. As they report back to the JT, they update
>> >> >>> the
>> >> >>> counter(s)' status.
>> >> >>> The JT then will aggregate them.
>> >> >>>
>> >> >>> In terms of performance, Counters take up some memory in the JT so
>> >> >>> while
>> >> >>> its OK to use them, if you abuse them, you can run in to issues.
>> >> >>> As to limits... I guess that will depend on the amount of memory on
>> >> >>> the
>> >> >>> JT machine, the size of the cluster (Number of TT) and the number
>> >> >>> of
>> >> >>> counters.
>> >> >>>
>> >> >>> In terms of global accessibility... Maybe.
>> >> >>>
>> >> >>> The reason I say maybe is that I'm not sure by what you mean by
>> >> >>> globally
>> >> >>> accessible.
>> >> >>> If a task creates and implements a dynamic counter... I know that
>> >> >>> it
>> >> >>> will
>> >> >>> eventually be reflected in the JT. However, I do not believe that a
>> >> >>> separate
>> >> >>> Task could connect with the JT and see if the counter exists or if
>> >> >>> it
>> >> >>> could
>> >> >>> get a value or even an accurate value since the updates are
>> >> >>> asynchronous.
>> >> >>> Not to mention that I don't believe that the counters are
>> >> >>> aggregated
>> >> >>> until
>> >> >>> the job ends. It would make sense that the JT maintains a unique
>> >> >>> counter for
>> >> >>> each task until the tasks complete. (If a task fails, it would have
>> >> >>> to
>> >> >>> delete the counters so that when the task is restarted the correct
>> >> >>> count is
>> >> >>> maintained. )  Note, I haven't looked at the source code so I am
>> >> >>> probably
>> >> >>> wrong.
>> >> >>>
>> >> >>> HTH
>> >> >>> Mike
>> >> >>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>> >> >>>
>> >> >>> Hi guys,
>> >> >>>
>> >> >>> I have some quick questions regarding to Hadoop counter,
>> >> >>>
>> >> >>> Hadoop counter (customer defined) is global accessible (for both
>> >> >>> read
>> >> >>> and
>> >> >>> write) for all Mappers and Reducers in a job?
>> >> >>> What is the performance and best practices of using Hadoop
>> >> >>> counters? I
>> >> >>> am
>> >> >>> not sure if using Hadoop counters too heavy, there will be
>> >> >>> performance
>> >> >>> downgrade to the whole job?
>> >> >>>
>> >> >>> regards,
>> >> >>> Lin
>> >> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Bertrand Dechoux
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Jay Vyas
>> >> > http://jayunit100.blogspot.com
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Hi Harsh,

Thanks for the brilliant reply.

For your comments -- "Yes, they are ultimately stored at JT until the job
is retired out of
heap memory (in which case, they get stored into the JobHistory
location and format).", does it mean only running job's counter will
consume JT memory, for completed job, counter will be stored in disk (I
think for "JobHistory location and format" is on disk?)?

regards,
Lin

On Sat, Oct 20, 2012 at 12:19 AM, Harsh J <ha...@cloudera.com> wrote:

> Hi,
>
> Inline.
>
> On Fri, Oct 19, 2012 at 9:39 PM, Lin Ma <li...@gmail.com> wrote:
> > Hi Harsh,
> >
> > Thanks for the great reply. Two basic questions,
> >
> > - Where the counters' value are stored for successful job? On JT?
>
> Yes, they are ultimately stored at JT until the job is retired out of
> heap memory (in which case, they get stored into the JobHistory
> location and format).
>
> > - Supposing a specific job A completed successfully and updated related
> > counters, is it possible for another specific job B to read counters
> updated
> > by previous job A? If yes, how?
>
> Yes, possible, use the RunningJob object from the previous job (or
> capture one) and query it. APIs you're interested in:
>
> Grab a query-able object (RunningJob and/or a Job):
>
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/JobClient.html#getJob(org.apache.hadoop.mapred.JobID)
> or
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Cluster.html#getJob(org.apache.hadoop.mapreduce.JobID)
>
> Query counters:
>
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/RunningJob.html#getCounters()
> or
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Job.html#getCounters()
>
> > regards,
> > Lin
> >
> >
> > On Fri, Oct 19, 2012 at 11:50 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Bejoy is almost right, except that counters are reported upon progress
> >> of tasks itself (via TT heartbeats to JT actually), but the final
> >> counter representation is computed only with successful task reports
> >> the job received, not from any failed or killed ones.
> >>
> >> On Fri, Oct 19, 2012 at 8:51 PM, Bejoy KS <be...@gmail.com>
> wrote:
> >> > Hi Jay
> >> >
> >> > Counters are reported at the end of a task to JT. So if a task fails
> the
> >> > counters from that task are not send to JT and hence won't be included
> >> > in
> >> > the final value of counters from that Job.
> >> > Regards
> >> > Bejoy KS
> >> >
> >> > Sent from handheld, please excuse typos.
> >> > ________________________________
> >> > From: Jay Vyas <ja...@gmail.com>
> >> > Date: Fri, 19 Oct 2012 10:18:42 -0500
> >> > To: <us...@hadoop.apache.org>
> >> > ReplyTo: user@hadoop.apache.org
> >> > Subject: Re: Hadoop counter
> >> >
> >> > Ah this answers alot about why some of my dynamic counters never show
> up
> >> > and
> >> > i have to bite my nails waiting to see whats going on until the end of
> >> > the
> >> > job- thanks.
> >> >
> >> > Another question: what happens if a task fails ?  What happen to the
> >> > counters for it ?  Do they dissappear into the ether? Or do they get
> >> > merged
> >> > in with the counters from other tasks?
> >> >
> >> > On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <dechouxb@gmail.com
> >
> >> > wrote:
> >> >>
> >> >> And by default the number of counters is limited to 120 with the
> >> >> mapreduce.job.counters.limit property.
> >> >> They are useful for displaying short statistics about a job but
> should
> >> >> not
> >> >> be used for results (imho).
> >> >> I know people may misuse them but I haven't tried so I wouldn't be
> able
> >> >> to
> >> >> list the caveats.
> >> >>
> >> >> Regards
> >> >>
> >> >> Bertrand
> >> >>
> >> >>
> >> >> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel
> >> >> <mi...@hotmail.com>
> >> >> wrote:
> >> >>>
> >> >>> As I understand it... each Task has its own counters and are
> >> >>> independently updated. As they report back to the JT, they update
> the
> >> >>> counter(s)' status.
> >> >>> The JT then will aggregate them.
> >> >>>
> >> >>> In terms of performance, Counters take up some memory in the JT so
> >> >>> while
> >> >>> its OK to use them, if you abuse them, you can run in to issues.
> >> >>> As to limits... I guess that will depend on the amount of memory on
> >> >>> the
> >> >>> JT machine, the size of the cluster (Number of TT) and the number of
> >> >>> counters.
> >> >>>
> >> >>> In terms of global accessibility... Maybe.
> >> >>>
> >> >>> The reason I say maybe is that I'm not sure by what you mean by
> >> >>> globally
> >> >>> accessible.
> >> >>> If a task creates and implements a dynamic counter... I know that it
> >> >>> will
> >> >>> eventually be reflected in the JT. However, I do not believe that a
> >> >>> separate
> >> >>> Task could connect with the JT and see if the counter exists or if
> it
> >> >>> could
> >> >>> get a value or even an accurate value since the updates are
> >> >>> asynchronous.
> >> >>> Not to mention that I don't believe that the counters are aggregated
> >> >>> until
> >> >>> the job ends. It would make sense that the JT maintains a unique
> >> >>> counter for
> >> >>> each task until the tasks complete. (If a task fails, it would have
> to
> >> >>> delete the counters so that when the task is restarted the correct
> >> >>> count is
> >> >>> maintained. )  Note, I haven't looked at the source code so I am
> >> >>> probably
> >> >>> wrong.
> >> >>>
> >> >>> HTH
> >> >>> Mike
> >> >>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
> >> >>>
> >> >>> Hi guys,
> >> >>>
> >> >>> I have some quick questions regarding to Hadoop counter,
> >> >>>
> >> >>> Hadoop counter (customer defined) is global accessible (for both
> read
> >> >>> and
> >> >>> write) for all Mappers and Reducers in a job?
> >> >>> What is the performance and best practices of using Hadoop
> counters? I
> >> >>> am
> >> >>> not sure if using Hadoop counters too heavy, there will be
> performance
> >> >>> downgrade to the whole job?
> >> >>>
> >> >>> regards,
> >> >>> Lin
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Bertrand Dechoux
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Jay Vyas
> >> > http://jayunit100.blogspot.com
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Hi Harsh,

Thanks for the brilliant reply.

For your comments -- "Yes, they are ultimately stored at JT until the job
is retired out of
heap memory (in which case, they get stored into the JobHistory
location and format).", does it mean only running job's counter will
consume JT memory, for completed job, counter will be stored in disk (I
think for "JobHistory location and format" is on disk?)?

regards,
Lin

On Sat, Oct 20, 2012 at 12:19 AM, Harsh J <ha...@cloudera.com> wrote:

> Hi,
>
> Inline.
>
> On Fri, Oct 19, 2012 at 9:39 PM, Lin Ma <li...@gmail.com> wrote:
> > Hi Harsh,
> >
> > Thanks for the great reply. Two basic questions,
> >
> > - Where the counters' value are stored for successful job? On JT?
>
> Yes, they are ultimately stored at JT until the job is retired out of
> heap memory (in which case, they get stored into the JobHistory
> location and format).
>
> > - Supposing a specific job A completed successfully and updated related
> > counters, is it possible for another specific job B to read counters
> updated
> > by previous job A? If yes, how?
>
> Yes, possible, use the RunningJob object from the previous job (or
> capture one) and query it. APIs you're interested in:
>
> Grab a query-able object (RunningJob and/or a Job):
>
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/JobClient.html#getJob(org.apache.hadoop.mapred.JobID)
> or
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Cluster.html#getJob(org.apache.hadoop.mapreduce.JobID)
>
> Query counters:
>
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/RunningJob.html#getCounters()
> or
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Job.html#getCounters()
>
> > regards,
> > Lin
> >
> >
> > On Fri, Oct 19, 2012 at 11:50 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Bejoy is almost right, except that counters are reported upon progress
> >> of tasks itself (via TT heartbeats to JT actually), but the final
> >> counter representation is computed only with successful task reports
> >> the job received, not from any failed or killed ones.
> >>
> >> On Fri, Oct 19, 2012 at 8:51 PM, Bejoy KS <be...@gmail.com>
> wrote:
> >> > Hi Jay
> >> >
> >> > Counters are reported at the end of a task to JT. So if a task fails
> the
> >> > counters from that task are not send to JT and hence won't be included
> >> > in
> >> > the final value of counters from that Job.
> >> > Regards
> >> > Bejoy KS
> >> >
> >> > Sent from handheld, please excuse typos.
> >> > ________________________________
> >> > From: Jay Vyas <ja...@gmail.com>
> >> > Date: Fri, 19 Oct 2012 10:18:42 -0500
> >> > To: <us...@hadoop.apache.org>
> >> > ReplyTo: user@hadoop.apache.org
> >> > Subject: Re: Hadoop counter
> >> >
> >> > Ah this answers alot about why some of my dynamic counters never show
> up
> >> > and
> >> > i have to bite my nails waiting to see whats going on until the end of
> >> > the
> >> > job- thanks.
> >> >
> >> > Another question: what happens if a task fails ?  What happen to the
> >> > counters for it ?  Do they dissappear into the ether? Or do they get
> >> > merged
> >> > in with the counters from other tasks?
> >> >
> >> > On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <dechouxb@gmail.com
> >
> >> > wrote:
> >> >>
> >> >> And by default the number of counters is limited to 120 with the
> >> >> mapreduce.job.counters.limit property.
> >> >> They are useful for displaying short statistics about a job but
> should
> >> >> not
> >> >> be used for results (imho).
> >> >> I know people may misuse them but I haven't tried so I wouldn't be
> able
> >> >> to
> >> >> list the caveats.
> >> >>
> >> >> Regards
> >> >>
> >> >> Bertrand
> >> >>
> >> >>
> >> >> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel
> >> >> <mi...@hotmail.com>
> >> >> wrote:
> >> >>>
> >> >>> As I understand it... each Task has its own counters and are
> >> >>> independently updated. As they report back to the JT, they update
> the
> >> >>> counter(s)' status.
> >> >>> The JT then will aggregate them.
> >> >>>
> >> >>> In terms of performance, Counters take up some memory in the JT so
> >> >>> while
> >> >>> its OK to use them, if you abuse them, you can run in to issues.
> >> >>> As to limits... I guess that will depend on the amount of memory on
> >> >>> the
> >> >>> JT machine, the size of the cluster (Number of TT) and the number of
> >> >>> counters.
> >> >>>
> >> >>> In terms of global accessibility... Maybe.
> >> >>>
> >> >>> The reason I say maybe is that I'm not sure by what you mean by
> >> >>> globally
> >> >>> accessible.
> >> >>> If a task creates and implements a dynamic counter... I know that it
> >> >>> will
> >> >>> eventually be reflected in the JT. However, I do not believe that a
> >> >>> separate
> >> >>> Task could connect with the JT and see if the counter exists or if
> it
> >> >>> could
> >> >>> get a value or even an accurate value since the updates are
> >> >>> asynchronous.
> >> >>> Not to mention that I don't believe that the counters are aggregated
> >> >>> until
> >> >>> the job ends. It would make sense that the JT maintains a unique
> >> >>> counter for
> >> >>> each task until the tasks complete. (If a task fails, it would have
> to
> >> >>> delete the counters so that when the task is restarted the correct
> >> >>> count is
> >> >>> maintained. )  Note, I haven't looked at the source code so I am
> >> >>> probably
> >> >>> wrong.
> >> >>>
> >> >>> HTH
> >> >>> Mike
> >> >>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
> >> >>>
> >> >>> Hi guys,
> >> >>>
> >> >>> I have some quick questions regarding to Hadoop counter,
> >> >>>
> >> >>> Hadoop counter (customer defined) is global accessible (for both
> read
> >> >>> and
> >> >>> write) for all Mappers and Reducers in a job?
> >> >>> What is the performance and best practices of using Hadoop
> counters? I
> >> >>> am
> >> >>> not sure if using Hadoop counters too heavy, there will be
> performance
> >> >>> downgrade to the whole job?
> >> >>>
> >> >>> regards,
> >> >>> Lin
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Bertrand Dechoux
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Jay Vyas
> >> > http://jayunit100.blogspot.com
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Hi Harsh,

Thanks for the brilliant reply.

For your comments -- "Yes, they are ultimately stored at JT until the job
is retired out of
heap memory (in which case, they get stored into the JobHistory
location and format).", does it mean only running job's counter will
consume JT memory, for completed job, counter will be stored in disk (I
think for "JobHistory location and format" is on disk?)?

regards,
Lin

On Sat, Oct 20, 2012 at 12:19 AM, Harsh J <ha...@cloudera.com> wrote:

> Hi,
>
> Inline.
>
> On Fri, Oct 19, 2012 at 9:39 PM, Lin Ma <li...@gmail.com> wrote:
> > Hi Harsh,
> >
> > Thanks for the great reply. Two basic questions,
> >
> > - Where the counters' value are stored for successful job? On JT?
>
> Yes, they are ultimately stored at JT until the job is retired out of
> heap memory (in which case, they get stored into the JobHistory
> location and format).
>
> > - Supposing a specific job A completed successfully and updated related
> > counters, is it possible for another specific job B to read counters
> updated
> > by previous job A? If yes, how?
>
> Yes, possible, use the RunningJob object from the previous job (or
> capture one) and query it. APIs you're interested in:
>
> Grab a query-able object (RunningJob and/or a Job):
>
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/JobClient.html#getJob(org.apache.hadoop.mapred.JobID)
> or
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Cluster.html#getJob(org.apache.hadoop.mapreduce.JobID)
>
> Query counters:
>
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/RunningJob.html#getCounters()
> or
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Job.html#getCounters()
>
> > regards,
> > Lin
> >
> >
> > On Fri, Oct 19, 2012 at 11:50 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Bejoy is almost right, except that counters are reported upon progress
> >> of tasks itself (via TT heartbeats to JT actually), but the final
> >> counter representation is computed only with successful task reports
> >> the job received, not from any failed or killed ones.
> >>
> >> On Fri, Oct 19, 2012 at 8:51 PM, Bejoy KS <be...@gmail.com>
> wrote:
> >> > Hi Jay
> >> >
> >> > Counters are reported at the end of a task to JT. So if a task fails
> the
> >> > counters from that task are not send to JT and hence won't be included
> >> > in
> >> > the final value of counters from that Job.
> >> > Regards
> >> > Bejoy KS
> >> >
> >> > Sent from handheld, please excuse typos.
> >> > ________________________________
> >> > From: Jay Vyas <ja...@gmail.com>
> >> > Date: Fri, 19 Oct 2012 10:18:42 -0500
> >> > To: <us...@hadoop.apache.org>
> >> > ReplyTo: user@hadoop.apache.org
> >> > Subject: Re: Hadoop counter
> >> >
> >> > Ah this answers alot about why some of my dynamic counters never show
> up
> >> > and
> >> > i have to bite my nails waiting to see whats going on until the end of
> >> > the
> >> > job- thanks.
> >> >
> >> > Another question: what happens if a task fails ?  What happen to the
> >> > counters for it ?  Do they dissappear into the ether? Or do they get
> >> > merged
> >> > in with the counters from other tasks?
> >> >
> >> > On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <dechouxb@gmail.com
> >
> >> > wrote:
> >> >>
> >> >> And by default the number of counters is limited to 120 with the
> >> >> mapreduce.job.counters.limit property.
> >> >> They are useful for displaying short statistics about a job but
> should
> >> >> not
> >> >> be used for results (imho).
> >> >> I know people may misuse them but I haven't tried so I wouldn't be
> able
> >> >> to
> >> >> list the caveats.
> >> >>
> >> >> Regards
> >> >>
> >> >> Bertrand
> >> >>
> >> >>
> >> >> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel
> >> >> <mi...@hotmail.com>
> >> >> wrote:
> >> >>>
> >> >>> As I understand it... each Task has its own counters and are
> >> >>> independently updated. As they report back to the JT, they update
> the
> >> >>> counter(s)' status.
> >> >>> The JT then will aggregate them.
> >> >>>
> >> >>> In terms of performance, Counters take up some memory in the JT so
> >> >>> while
> >> >>> its OK to use them, if you abuse them, you can run in to issues.
> >> >>> As to limits... I guess that will depend on the amount of memory on
> >> >>> the
> >> >>> JT machine, the size of the cluster (Number of TT) and the number of
> >> >>> counters.
> >> >>>
> >> >>> In terms of global accessibility... Maybe.
> >> >>>
> >> >>> The reason I say maybe is that I'm not sure by what you mean by
> >> >>> globally
> >> >>> accessible.
> >> >>> If a task creates and implements a dynamic counter... I know that it
> >> >>> will
> >> >>> eventually be reflected in the JT. However, I do not believe that a
> >> >>> separate
> >> >>> Task could connect with the JT and see if the counter exists or if
> it
> >> >>> could
> >> >>> get a value or even an accurate value since the updates are
> >> >>> asynchronous.
> >> >>> Not to mention that I don't believe that the counters are aggregated
> >> >>> until
> >> >>> the job ends. It would make sense that the JT maintains a unique
> >> >>> counter for
> >> >>> each task until the tasks complete. (If a task fails, it would have
> to
> >> >>> delete the counters so that when the task is restarted the correct
> >> >>> count is
> >> >>> maintained. )  Note, I haven't looked at the source code so I am
> >> >>> probably
> >> >>> wrong.
> >> >>>
> >> >>> HTH
> >> >>> Mike
> >> >>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
> >> >>>
> >> >>> Hi guys,
> >> >>>
> >> >>> I have some quick questions regarding to Hadoop counter,
> >> >>>
> >> >>> Hadoop counter (customer defined) is global accessible (for both
> read
> >> >>> and
> >> >>> write) for all Mappers and Reducers in a job?
> >> >>> What is the performance and best practices of using Hadoop
> counters? I
> >> >>> am
> >> >>> not sure if using Hadoop counters too heavy, there will be
> performance
> >> >>> downgrade to the whole job?
> >> >>>
> >> >>> regards,
> >> >>> Lin
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Bertrand Dechoux
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Jay Vyas
> >> > http://jayunit100.blogspot.com
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Hi Harsh,

Thanks for the brilliant reply.

For your comments -- "Yes, they are ultimately stored at JT until the job
is retired out of
heap memory (in which case, they get stored into the JobHistory
location and format).", does it mean only running job's counter will
consume JT memory, for completed job, counter will be stored in disk (I
think for "JobHistory location and format" is on disk?)?

regards,
Lin

On Sat, Oct 20, 2012 at 12:19 AM, Harsh J <ha...@cloudera.com> wrote:

> Hi,
>
> Inline.
>
> On Fri, Oct 19, 2012 at 9:39 PM, Lin Ma <li...@gmail.com> wrote:
> > Hi Harsh,
> >
> > Thanks for the great reply. Two basic questions,
> >
> > - Where the counters' value are stored for successful job? On JT?
>
> Yes, they are ultimately stored at JT until the job is retired out of
> heap memory (in which case, they get stored into the JobHistory
> location and format).
>
> > - Supposing a specific job A completed successfully and updated related
> > counters, is it possible for another specific job B to read counters
> updated
> > by previous job A? If yes, how?
>
> Yes, possible, use the RunningJob object from the previous job (or
> capture one) and query it. APIs you're interested in:
>
> Grab a query-able object (RunningJob and/or a Job):
>
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/JobClient.html#getJob(org.apache.hadoop.mapred.JobID)
> or
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Cluster.html#getJob(org.apache.hadoop.mapreduce.JobID)
>
> Query counters:
>
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/RunningJob.html#getCounters()
> or
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Job.html#getCounters()
>
> > regards,
> > Lin
> >
> >
> > On Fri, Oct 19, 2012 at 11:50 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Bejoy is almost right, except that counters are reported upon progress
> >> of tasks itself (via TT heartbeats to JT actually), but the final
> >> counter representation is computed only with successful task reports
> >> the job received, not from any failed or killed ones.
> >>
> >> On Fri, Oct 19, 2012 at 8:51 PM, Bejoy KS <be...@gmail.com>
> wrote:
> >> > Hi Jay
> >> >
> >> > Counters are reported at the end of a task to JT. So if a task fails
> the
> >> > counters from that task are not send to JT and hence won't be included
> >> > in
> >> > the final value of counters from that Job.
> >> > Regards
> >> > Bejoy KS
> >> >
> >> > Sent from handheld, please excuse typos.
> >> > ________________________________
> >> > From: Jay Vyas <ja...@gmail.com>
> >> > Date: Fri, 19 Oct 2012 10:18:42 -0500
> >> > To: <us...@hadoop.apache.org>
> >> > ReplyTo: user@hadoop.apache.org
> >> > Subject: Re: Hadoop counter
> >> >
> >> > Ah this answers alot about why some of my dynamic counters never show
> up
> >> > and
> >> > i have to bite my nails waiting to see whats going on until the end of
> >> > the
> >> > job- thanks.
> >> >
> >> > Another question: what happens if a task fails ?  What happen to the
> >> > counters for it ?  Do they dissappear into the ether? Or do they get
> >> > merged
> >> > in with the counters from other tasks?
> >> >
> >> > On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <dechouxb@gmail.com
> >
> >> > wrote:
> >> >>
> >> >> And by default the number of counters is limited to 120 with the
> >> >> mapreduce.job.counters.limit property.
> >> >> They are useful for displaying short statistics about a job but
> should
> >> >> not
> >> >> be used for results (imho).
> >> >> I know people may misuse them but I haven't tried so I wouldn't be
> able
> >> >> to
> >> >> list the caveats.
> >> >>
> >> >> Regards
> >> >>
> >> >> Bertrand
> >> >>
> >> >>
> >> >> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel
> >> >> <mi...@hotmail.com>
> >> >> wrote:
> >> >>>
> >> >>> As I understand it... each Task has its own counters and are
> >> >>> independently updated. As they report back to the JT, they update
> the
> >> >>> counter(s)' status.
> >> >>> The JT then will aggregate them.
> >> >>>
> >> >>> In terms of performance, Counters take up some memory in the JT so
> >> >>> while
> >> >>> its OK to use them, if you abuse them, you can run in to issues.
> >> >>> As to limits... I guess that will depend on the amount of memory on
> >> >>> the
> >> >>> JT machine, the size of the cluster (Number of TT) and the number of
> >> >>> counters.
> >> >>>
> >> >>> In terms of global accessibility... Maybe.
> >> >>>
> >> >>> The reason I say maybe is that I'm not sure by what you mean by
> >> >>> globally
> >> >>> accessible.
> >> >>> If a task creates and implements a dynamic counter... I know that it
> >> >>> will
> >> >>> eventually be reflected in the JT. However, I do not believe that a
> >> >>> separate
> >> >>> Task could connect with the JT and see if the counter exists or if
> it
> >> >>> could
> >> >>> get a value or even an accurate value since the updates are
> >> >>> asynchronous.
> >> >>> Not to mention that I don't believe that the counters are aggregated
> >> >>> until
> >> >>> the job ends. It would make sense that the JT maintains a unique
> >> >>> counter for
> >> >>> each task until the tasks complete. (If a task fails, it would have
> to
> >> >>> delete the counters so that when the task is restarted the correct
> >> >>> count is
> >> >>> maintained. )  Note, I haven't looked at the source code so I am
> >> >>> probably
> >> >>> wrong.
> >> >>>
> >> >>> HTH
> >> >>> Mike
> >> >>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
> >> >>>
> >> >>> Hi guys,
> >> >>>
> >> >>> I have some quick questions regarding to Hadoop counter,
> >> >>>
> >> >>> Hadoop counter (customer defined) is global accessible (for both
> read
> >> >>> and
> >> >>> write) for all Mappers and Reducers in a job?
> >> >>> What is the performance and best practices of using Hadoop
> counters? I
> >> >>> am
> >> >>> not sure if using Hadoop counters too heavy, there will be
> performance
> >> >>> downgrade to the whole job?
> >> >>>
> >> >>> regards,
> >> >>> Lin
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Bertrand Dechoux
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Jay Vyas
> >> > http://jayunit100.blogspot.com
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: Hadoop counter

Posted by Harsh J <ha...@cloudera.com>.

Hi,

Inline.

On Fri, Oct 19, 2012 at 9:39 PM, Lin Ma <li...@gmail.com> wrote:
> Hi Harsh,
>
> Thanks for the great reply. Two basic questions,
>
> - Where the counters' value are stored for successful job? On JT?

Yes, they are ultimately stored at JT until the job is retired out of
heap memory (in which case, they get stored into the JobHistory
location and format).

> - Supposing a specific job A completed successfully and updated related
> counters, is it possible for another specific job B to read counters updated
> by previous job A? If yes, how?

Yes, possible, use the RunningJob object from the previous job (or
capture one) and query it. APIs you're interested in:

Grab a query-able object (RunningJob and/or a Job):
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/JobClient.html#getJob(org.apache.hadoop.mapred.JobID)
or http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Cluster.html#getJob(org.apache.hadoop.mapreduce.JobID)

Query counters:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/RunningJob.html#getCounters()
or http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Job.html#getCounters()

> regards,
> Lin
>
>
> On Fri, Oct 19, 2012 at 11:50 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Bejoy is almost right, except that counters are reported upon progress
>> of tasks itself (via TT heartbeats to JT actually), but the final
>> counter representation is computed only with successful task reports
>> the job received, not from any failed or killed ones.
>>
>> On Fri, Oct 19, 2012 at 8:51 PM, Bejoy KS <be...@gmail.com> wrote:
>> > Hi Jay
>> >
>> > Counters are reported at the end of a task to JT. So if a task fails the
>> > counters from that task are not send to JT and hence won't be included
>> > in
>> > the final value of counters from that Job.
>> > Regards
>> > Bejoy KS
>> >
>> > Sent from handheld, please excuse typos.
>> > ________________________________
>> > From: Jay Vyas <ja...@gmail.com>
>> > Date: Fri, 19 Oct 2012 10:18:42 -0500
>> > To: <us...@hadoop.apache.org>
>> > ReplyTo: user@hadoop.apache.org
>> > Subject: Re: Hadoop counter
>> >
>> > Ah this answers alot about why some of my dynamic counters never show up
>> > and
>> > i have to bite my nails waiting to see whats going on until the end of
>> > the
>> > job- thanks.
>> >
>> > Another question: what happens if a task fails ?  What happen to the
>> > counters for it ?  Do they dissappear into the ether? Or do they get
>> > merged
>> > in with the counters from other tasks?
>> >
>> > On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <de...@gmail.com>
>> > wrote:
>> >>
>> >> And by default the number of counters is limited to 120 with the
>> >> mapreduce.job.counters.limit property.
>> >> They are useful for displaying short statistics about a job but should
>> >> not
>> >> be used for results (imho).
>> >> I know people may misuse them but I haven't tried so I wouldn't be able
>> >> to
>> >> list the caveats.
>> >>
>> >> Regards
>> >>
>> >> Bertrand
>> >>
>> >>
>> >> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel
>> >> <mi...@hotmail.com>
>> >> wrote:
>> >>>
>> >>> As I understand it... each Task has its own counters and are
>> >>> independently updated. As they report back to the JT, they update the
>> >>> counter(s)' status.
>> >>> The JT then will aggregate them.
>> >>>
>> >>> In terms of performance, Counters take up some memory in the JT so
>> >>> while
>> >>> its OK to use them, if you abuse them, you can run in to issues.
>> >>> As to limits... I guess that will depend on the amount of memory on
>> >>> the
>> >>> JT machine, the size of the cluster (Number of TT) and the number of
>> >>> counters.
>> >>>
>> >>> In terms of global accessibility... Maybe.
>> >>>
>> >>> The reason I say maybe is that I'm not sure by what you mean by
>> >>> globally
>> >>> accessible.
>> >>> If a task creates and implements a dynamic counter... I know that it
>> >>> will
>> >>> eventually be reflected in the JT. However, I do not believe that a
>> >>> separate
>> >>> Task could connect with the JT and see if the counter exists or if it
>> >>> could
>> >>> get a value or even an accurate value since the updates are
>> >>> asynchronous.
>> >>> Not to mention that I don't believe that the counters are aggregated
>> >>> until
>> >>> the job ends. It would make sense that the JT maintains a unique
>> >>> counter for
>> >>> each task until the tasks complete. (If a task fails, it would have to
>> >>> delete the counters so that when the task is restarted the correct
>> >>> count is
>> >>> maintained. )  Note, I haven't looked at the source code so I am
>> >>> probably
>> >>> wrong.
>> >>>
>> >>> HTH
>> >>> Mike
>> >>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>> >>>
>> >>> Hi guys,
>> >>>
>> >>> I have some quick questions regarding to Hadoop counter,
>> >>>
>> >>> Hadoop counter (customer defined) is global accessible (for both read
>> >>> and
>> >>> write) for all Mappers and Reducers in a job?
>> >>> What is the performance and best practices of using Hadoop counters? I
>> >>> am
>> >>> not sure if using Hadoop counters too heavy, there will be performance
>> >>> downgrade to the whole job?
>> >>>
>> >>> regards,
>> >>> Lin
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Bertrand Dechoux
>> >
>> >
>> >
>> >
>> > --
>> > Jay Vyas
>> > http://jayunit100.blogspot.com
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: Hadoop counter

Posted by Harsh J <ha...@cloudera.com>.

Hi,

Inline.

On Fri, Oct 19, 2012 at 9:39 PM, Lin Ma <li...@gmail.com> wrote:
> Hi Harsh,
>
> Thanks for the great reply. Two basic questions,
>
> - Where the counters' value are stored for successful job? On JT?

Yes, they are ultimately stored at JT until the job is retired out of
heap memory (in which case, they get stored into the JobHistory
location and format).

> - Supposing a specific job A completed successfully and updated related
> counters, is it possible for another specific job B to read counters updated
> by previous job A? If yes, how?

Yes, possible, use the RunningJob object from the previous job (or
capture one) and query it. APIs you're interested in:

Grab a query-able object (RunningJob and/or a Job):
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/JobClient.html#getJob(org.apache.hadoop.mapred.JobID)
or http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Cluster.html#getJob(org.apache.hadoop.mapreduce.JobID)

Query counters:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/RunningJob.html#getCounters()
or http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Job.html#getCounters()

> regards,
> Lin
>
>
> On Fri, Oct 19, 2012 at 11:50 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Bejoy is almost right, except that counters are reported upon progress
>> of tasks itself (via TT heartbeats to JT actually), but the final
>> counter representation is computed only with successful task reports
>> the job received, not from any failed or killed ones.
>>
>> On Fri, Oct 19, 2012 at 8:51 PM, Bejoy KS <be...@gmail.com> wrote:
>> > Hi Jay
>> >
>> > Counters are reported at the end of a task to JT. So if a task fails the
>> > counters from that task are not send to JT and hence won't be included
>> > in
>> > the final value of counters from that Job.
>> > Regards
>> > Bejoy KS
>> >
>> > Sent from handheld, please excuse typos.
>> > ________________________________
>> > From: Jay Vyas <ja...@gmail.com>
>> > Date: Fri, 19 Oct 2012 10:18:42 -0500
>> > To: <us...@hadoop.apache.org>
>> > ReplyTo: user@hadoop.apache.org
>> > Subject: Re: Hadoop counter
>> >
>> > Ah this answers alot about why some of my dynamic counters never show up
>> > and
>> > i have to bite my nails waiting to see whats going on until the end of
>> > the
>> > job- thanks.
>> >
>> > Another question: what happens if a task fails ?  What happen to the
>> > counters for it ?  Do they dissappear into the ether? Or do they get
>> > merged
>> > in with the counters from other tasks?
>> >
>> > On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <de...@gmail.com>
>> > wrote:
>> >>
>> >> And by default the number of counters is limited to 120 with the
>> >> mapreduce.job.counters.limit property.
>> >> They are useful for displaying short statistics about a job but should
>> >> not
>> >> be used for results (imho).
>> >> I know people may misuse them but I haven't tried so I wouldn't be able
>> >> to
>> >> list the caveats.
>> >>
>> >> Regards
>> >>
>> >> Bertrand
>> >>
>> >>
>> >> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel
>> >> <mi...@hotmail.com>
>> >> wrote:
>> >>>
>> >>> As I understand it... each Task has its own counters and are
>> >>> independently updated. As they report back to the JT, they update the
>> >>> counter(s)' status.
>> >>> The JT then will aggregate them.
>> >>>
>> >>> In terms of performance, Counters take up some memory in the JT so
>> >>> while
>> >>> its OK to use them, if you abuse them, you can run in to issues.
>> >>> As to limits... I guess that will depend on the amount of memory on
>> >>> the
>> >>> JT machine, the size of the cluster (Number of TT) and the number of
>> >>> counters.
>> >>>
>> >>> In terms of global accessibility... Maybe.
>> >>>
>> >>> The reason I say maybe is that I'm not sure by what you mean by
>> >>> globally
>> >>> accessible.
>> >>> If a task creates and implements a dynamic counter... I know that it
>> >>> will
>> >>> eventually be reflected in the JT. However, I do not believe that a
>> >>> separate
>> >>> Task could connect with the JT and see if the counter exists or if it
>> >>> could
>> >>> get a value or even an accurate value since the updates are
>> >>> asynchronous.
>> >>> Not to mention that I don't believe that the counters are aggregated
>> >>> until
>> >>> the job ends. It would make sense that the JT maintains a unique
>> >>> counter for
>> >>> each task until the tasks complete. (If a task fails, it would have to
>> >>> delete the counters so that when the task is restarted the correct
>> >>> count is
>> >>> maintained. )  Note, I haven't looked at the source code so I am
>> >>> probably
>> >>> wrong.
>> >>>
>> >>> HTH
>> >>> Mike
>> >>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>> >>>
>> >>> Hi guys,
>> >>>
>> >>> I have some quick questions regarding to Hadoop counter,
>> >>>
>> >>> Hadoop counter (customer defined) is global accessible (for both read
>> >>> and
>> >>> write) for all Mappers and Reducers in a job?
>> >>> What is the performance and best practices of using Hadoop counters? I
>> >>> am
>> >>> not sure if using Hadoop counters too heavy, there will be performance
>> >>> downgrade to the whole job?
>> >>>
>> >>> regards,
>> >>> Lin
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Bertrand Dechoux
>> >
>> >
>> >
>> >
>> > --
>> > Jay Vyas
>> > http://jayunit100.blogspot.com
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: Hadoop counter

Posted by Harsh J <ha...@cloudera.com>.

Hi,

Inline.

On Fri, Oct 19, 2012 at 9:39 PM, Lin Ma <li...@gmail.com> wrote:
> Hi Harsh,
>
> Thanks for the great reply. Two basic questions,
>
> - Where the counters' value are stored for successful job? On JT?

Yes, they are ultimately stored at JT until the job is retired out of
heap memory (in which case, they get stored into the JobHistory
location and format).

> - Supposing a specific job A completed successfully and updated related
> counters, is it possible for another specific job B to read counters updated
> by previous job A? If yes, how?

Yes, possible, use the RunningJob object from the previous job (or
capture one) and query it. APIs you're interested in:

Grab a query-able object (RunningJob and/or a Job):
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/JobClient.html#getJob(org.apache.hadoop.mapred.JobID)
or http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Cluster.html#getJob(org.apache.hadoop.mapreduce.JobID)

Query counters:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/RunningJob.html#getCounters()
or http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Job.html#getCounters()

> regards,
> Lin
>
>
> On Fri, Oct 19, 2012 at 11:50 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Bejoy is almost right, except that counters are reported upon progress
>> of tasks itself (via TT heartbeats to JT actually), but the final
>> counter representation is computed only with successful task reports
>> the job received, not from any failed or killed ones.
>>
>> On Fri, Oct 19, 2012 at 8:51 PM, Bejoy KS <be...@gmail.com> wrote:
>> > Hi Jay
>> >
>> > Counters are reported at the end of a task to JT. So if a task fails the
>> > counters from that task are not send to JT and hence won't be included
>> > in
>> > the final value of counters from that Job.
>> > Regards
>> > Bejoy KS
>> >
>> > Sent from handheld, please excuse typos.
>> > ________________________________
>> > From: Jay Vyas <ja...@gmail.com>
>> > Date: Fri, 19 Oct 2012 10:18:42 -0500
>> > To: <us...@hadoop.apache.org>
>> > ReplyTo: user@hadoop.apache.org
>> > Subject: Re: Hadoop counter
>> >
>> > Ah this answers alot about why some of my dynamic counters never show up
>> > and
>> > i have to bite my nails waiting to see whats going on until the end of
>> > the
>> > job- thanks.
>> >
>> > Another question: what happens if a task fails ?  What happen to the
>> > counters for it ?  Do they dissappear into the ether? Or do they get
>> > merged
>> > in with the counters from other tasks?
>> >
>> > On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <de...@gmail.com>
>> > wrote:
>> >>
>> >> And by default the number of counters is limited to 120 with the
>> >> mapreduce.job.counters.limit property.
>> >> They are useful for displaying short statistics about a job but should
>> >> not
>> >> be used for results (imho).
>> >> I know people may misuse them but I haven't tried so I wouldn't be able
>> >> to
>> >> list the caveats.
>> >>
>> >> Regards
>> >>
>> >> Bertrand
>> >>
>> >>
>> >> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel
>> >> <mi...@hotmail.com>
>> >> wrote:
>> >>>
>> >>> As I understand it... each Task has its own counters and are
>> >>> independently updated. As they report back to the JT, they update the
>> >>> counter(s)' status.
>> >>> The JT then will aggregate them.
>> >>>
>> >>> In terms of performance, Counters take up some memory in the JT so
>> >>> while
>> >>> its OK to use them, if you abuse them, you can run in to issues.
>> >>> As to limits... I guess that will depend on the amount of memory on
>> >>> the
>> >>> JT machine, the size of the cluster (Number of TT) and the number of
>> >>> counters.
>> >>>
>> >>> In terms of global accessibility... Maybe.
>> >>>
>> >>> The reason I say maybe is that I'm not sure by what you mean by
>> >>> globally
>> >>> accessible.
>> >>> If a task creates and implements a dynamic counter... I know that it
>> >>> will
>> >>> eventually be reflected in the JT. However, I do not believe that a
>> >>> separate
>> >>> Task could connect with the JT and see if the counter exists or if it
>> >>> could
>> >>> get a value or even an accurate value since the updates are
>> >>> asynchronous.
>> >>> Not to mention that I don't believe that the counters are aggregated
>> >>> until
>> >>> the job ends. It would make sense that the JT maintains a unique
>> >>> counter for
>> >>> each task until the tasks complete. (If a task fails, it would have to
>> >>> delete the counters so that when the task is restarted the correct
>> >>> count is
>> >>> maintained. )  Note, I haven't looked at the source code so I am
>> >>> probably
>> >>> wrong.
>> >>>
>> >>> HTH
>> >>> Mike
>> >>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>> >>>
>> >>> Hi guys,
>> >>>
>> >>> I have some quick questions regarding to Hadoop counter,
>> >>>
>> >>> Hadoop counter (customer defined) is global accessible (for both read
>> >>> and
>> >>> write) for all Mappers and Reducers in a job?
>> >>> What is the performance and best practices of using Hadoop counters? I
>> >>> am
>> >>> not sure if using Hadoop counters too heavy, there will be performance
>> >>> downgrade to the whole job?
>> >>>
>> >>> regards,
>> >>> Lin
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Bertrand Dechoux
>> >
>> >
>> >
>> >
>> > --
>> > Jay Vyas
>> > http://jayunit100.blogspot.com
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: Hadoop counter

Posted by Harsh J <ha...@cloudera.com>.

Hi,

Inline.

On Fri, Oct 19, 2012 at 9:39 PM, Lin Ma <li...@gmail.com> wrote:
> Hi Harsh,
>
> Thanks for the great reply. Two basic questions,
>
> - Where the counters' value are stored for successful job? On JT?

Yes, they are ultimately stored at JT until the job is retired out of
heap memory (in which case, they get stored into the JobHistory
location and format).

> - Supposing a specific job A completed successfully and updated related
> counters, is it possible for another specific job B to read counters updated
> by previous job A? If yes, how?

Yes, possible, use the RunningJob object from the previous job (or
capture one) and query it. APIs you're interested in:

Grab a query-able object (RunningJob and/or a Job):
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/JobClient.html#getJob(org.apache.hadoop.mapred.JobID)
or http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Cluster.html#getJob(org.apache.hadoop.mapreduce.JobID)

Query counters:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/RunningJob.html#getCounters()
or http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Job.html#getCounters()

> regards,
> Lin
>
>
> On Fri, Oct 19, 2012 at 11:50 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Bejoy is almost right, except that counters are reported upon progress
>> of tasks itself (via TT heartbeats to JT actually), but the final
>> counter representation is computed only with successful task reports
>> the job received, not from any failed or killed ones.
>>
>> On Fri, Oct 19, 2012 at 8:51 PM, Bejoy KS <be...@gmail.com> wrote:
>> > Hi Jay
>> >
>> > Counters are reported at the end of a task to JT. So if a task fails the
>> > counters from that task are not send to JT and hence won't be included
>> > in
>> > the final value of counters from that Job.
>> > Regards
>> > Bejoy KS
>> >
>> > Sent from handheld, please excuse typos.
>> > ________________________________
>> > From: Jay Vyas <ja...@gmail.com>
>> > Date: Fri, 19 Oct 2012 10:18:42 -0500
>> > To: <us...@hadoop.apache.org>
>> > ReplyTo: user@hadoop.apache.org
>> > Subject: Re: Hadoop counter
>> >
>> > Ah this answers alot about why some of my dynamic counters never show up
>> > and
>> > i have to bite my nails waiting to see whats going on until the end of
>> > the
>> > job- thanks.
>> >
>> > Another question: what happens if a task fails ?  What happen to the
>> > counters for it ?  Do they dissappear into the ether? Or do they get
>> > merged
>> > in with the counters from other tasks?
>> >
>> > On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <de...@gmail.com>
>> > wrote:
>> >>
>> >> And by default the number of counters is limited to 120 with the
>> >> mapreduce.job.counters.limit property.
>> >> They are useful for displaying short statistics about a job but should
>> >> not
>> >> be used for results (imho).
>> >> I know people may misuse them but I haven't tried so I wouldn't be able
>> >> to
>> >> list the caveats.
>> >>
>> >> Regards
>> >>
>> >> Bertrand
>> >>
>> >>
>> >> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel
>> >> <mi...@hotmail.com>
>> >> wrote:
>> >>>
>> >>> As I understand it... each Task has its own counters and are
>> >>> independently updated. As they report back to the JT, they update the
>> >>> counter(s)' status.
>> >>> The JT then will aggregate them.
>> >>>
>> >>> In terms of performance, Counters take up some memory in the JT so
>> >>> while
>> >>> its OK to use them, if you abuse them, you can run in to issues.
>> >>> As to limits... I guess that will depend on the amount of memory on
>> >>> the
>> >>> JT machine, the size of the cluster (Number of TT) and the number of
>> >>> counters.
>> >>>
>> >>> In terms of global accessibility... Maybe.
>> >>>
>> >>> The reason I say maybe is that I'm not sure by what you mean by
>> >>> globally
>> >>> accessible.
>> >>> If a task creates and implements a dynamic counter... I know that it
>> >>> will
>> >>> eventually be reflected in the JT. However, I do not believe that a
>> >>> separate
>> >>> Task could connect with the JT and see if the counter exists or if it
>> >>> could
>> >>> get a value or even an accurate value since the updates are
>> >>> asynchronous.
>> >>> Not to mention that I don't believe that the counters are aggregated
>> >>> until
>> >>> the job ends. It would make sense that the JT maintains a unique
>> >>> counter for
>> >>> each task until the tasks complete. (If a task fails, it would have to
>> >>> delete the counters so that when the task is restarted the correct
>> >>> count is
>> >>> maintained. )  Note, I haven't looked at the source code so I am
>> >>> probably
>> >>> wrong.
>> >>>
>> >>> HTH
>> >>> Mike
>> >>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>> >>>
>> >>> Hi guys,
>> >>>
>> >>> I have some quick questions regarding to Hadoop counter,
>> >>>
>> >>> Hadoop counter (customer defined) is global accessible (for both read
>> >>> and
>> >>> write) for all Mappers and Reducers in a job?
>> >>> What is the performance and best practices of using Hadoop counters? I
>> >>> am
>> >>> not sure if using Hadoop counters too heavy, there will be performance
>> >>> downgrade to the whole job?
>> >>>
>> >>> regards,
>> >>> Lin
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Bertrand Dechoux
>> >
>> >
>> >
>> >
>> > --
>> > Jay Vyas
>> > http://jayunit100.blogspot.com
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Hi Harsh,

Thanks for the great reply. Two basic questions,

- Where the counters' value are stored for successful job? On JT?
- Supposing a specific job A completed successfully and updated related
counters, is it possible for another specific job B to read counters
updated by previous job A? If yes, how?

regards,
Lin

On Fri, Oct 19, 2012 at 11:50 PM, Harsh J <ha...@cloudera.com> wrote:

> Bejoy is almost right, except that counters are reported upon progress
> of tasks itself (via TT heartbeats to JT actually), but the final
> counter representation is computed only with successful task reports
> the job received, not from any failed or killed ones.
>
> On Fri, Oct 19, 2012 at 8:51 PM, Bejoy KS <be...@gmail.com> wrote:
> > Hi Jay
> >
> > Counters are reported at the end of a task to JT. So if a task fails the
> > counters from that task are not send to JT and hence won't be included in
> > the final value of counters from that Job.
> > Regards
> > Bejoy KS
> >
> > Sent from handheld, please excuse typos.
> > ________________________________
> > From: Jay Vyas <ja...@gmail.com>
> > Date: Fri, 19 Oct 2012 10:18:42 -0500
> > To: <us...@hadoop.apache.org>
> > ReplyTo: user@hadoop.apache.org
> > Subject: Re: Hadoop counter
> >
> > Ah this answers alot about why some of my dynamic counters never show up
> and
> > i have to bite my nails waiting to see whats going on until the end of
> the
> > job- thanks.
> >
> > Another question: what happens if a task fails ?  What happen to the
> > counters for it ?  Do they dissappear into the ether? Or do they get
> merged
> > in with the counters from other tasks?
> >
> > On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <de...@gmail.com>
> > wrote:
> >>
> >> And by default the number of counters is limited to 120 with the
> >> mapreduce.job.counters.limit property.
> >> They are useful for displaying short statistics about a job but should
> not
> >> be used for results (imho).
> >> I know people may misuse them but I haven't tried so I wouldn't be able
> to
> >> list the caveats.
> >>
> >> Regards
> >>
> >> Bertrand
> >>
> >>
> >> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel <
> michael_segel@hotmail.com>
> >> wrote:
> >>>
> >>> As I understand it... each Task has its own counters and are
> >>> independently updated. As they report back to the JT, they update the
> >>> counter(s)' status.
> >>> The JT then will aggregate them.
> >>>
> >>> In terms of performance, Counters take up some memory in the JT so
> while
> >>> its OK to use them, if you abuse them, you can run in to issues.
> >>> As to limits... I guess that will depend on the amount of memory on the
> >>> JT machine, the size of the cluster (Number of TT) and the number of
> >>> counters.
> >>>
> >>> In terms of global accessibility... Maybe.
> >>>
> >>> The reason I say maybe is that I'm not sure by what you mean by
> globally
> >>> accessible.
> >>> If a task creates and implements a dynamic counter... I know that it
> will
> >>> eventually be reflected in the JT. However, I do not believe that a
> separate
> >>> Task could connect with the JT and see if the counter exists or if it
> could
> >>> get a value or even an accurate value since the updates are
> asynchronous.
> >>> Not to mention that I don't believe that the counters are aggregated
> until
> >>> the job ends. It would make sense that the JT maintains a unique
> counter for
> >>> each task until the tasks complete. (If a task fails, it would have to
> >>> delete the counters so that when the task is restarted the correct
> count is
> >>> maintained. )  Note, I haven't looked at the source code so I am
> probably
> >>> wrong.
> >>>
> >>> HTH
> >>> Mike
> >>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
> >>>
> >>> Hi guys,
> >>>
> >>> I have some quick questions regarding to Hadoop counter,
> >>>
> >>> Hadoop counter (customer defined) is global accessible (for both read
> and
> >>> write) for all Mappers and Reducers in a job?
> >>> What is the performance and best practices of using Hadoop counters? I
> am
> >>> not sure if using Hadoop counters too heavy, there will be performance
> >>> downgrade to the whole job?
> >>>
> >>> regards,
> >>> Lin
> >>>
> >>>
> >>
> >>
> >>
> >> --
> >> Bertrand Dechoux
> >
> >
> >
> >
> > --
> > Jay Vyas
> > http://jayunit100.blogspot.com
>
>
>
> --
> Harsh J
>

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Hi Harsh,

Thanks for the great reply. Two basic questions,

- Where the counters' value are stored for successful job? On JT?
- Supposing a specific job A completed successfully and updated related
counters, is it possible for another specific job B to read counters
updated by previous job A? If yes, how?

regards,
Lin

On Fri, Oct 19, 2012 at 11:50 PM, Harsh J <ha...@cloudera.com> wrote:

> Bejoy is almost right, except that counters are reported upon progress
> of tasks itself (via TT heartbeats to JT actually), but the final
> counter representation is computed only with successful task reports
> the job received, not from any failed or killed ones.
>
> On Fri, Oct 19, 2012 at 8:51 PM, Bejoy KS <be...@gmail.com> wrote:
> > Hi Jay
> >
> > Counters are reported at the end of a task to JT. So if a task fails the
> > counters from that task are not send to JT and hence won't be included in
> > the final value of counters from that Job.
> > Regards
> > Bejoy KS
> >
> > Sent from handheld, please excuse typos.
> > ________________________________
> > From: Jay Vyas <ja...@gmail.com>
> > Date: Fri, 19 Oct 2012 10:18:42 -0500
> > To: <us...@hadoop.apache.org>
> > ReplyTo: user@hadoop.apache.org
> > Subject: Re: Hadoop counter
> >
> > Ah this answers alot about why some of my dynamic counters never show up
> and
> > i have to bite my nails waiting to see whats going on until the end of
> the
> > job- thanks.
> >
> > Another question: what happens if a task fails ?  What happen to the
> > counters for it ?  Do they dissappear into the ether? Or do they get
> merged
> > in with the counters from other tasks?
> >
> > On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <de...@gmail.com>
> > wrote:
> >>
> >> And by default the number of counters is limited to 120 with the
> >> mapreduce.job.counters.limit property.
> >> They are useful for displaying short statistics about a job but should
> not
> >> be used for results (imho).
> >> I know people may misuse them but I haven't tried so I wouldn't be able
> to
> >> list the caveats.
> >>
> >> Regards
> >>
> >> Bertrand
> >>
> >>
> >> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel <
> michael_segel@hotmail.com>
> >> wrote:
> >>>
> >>> As I understand it... each Task has its own counters and are
> >>> independently updated. As they report back to the JT, they update the
> >>> counter(s)' status.
> >>> The JT then will aggregate them.
> >>>
> >>> In terms of performance, Counters take up some memory in the JT so
> while
> >>> its OK to use them, if you abuse them, you can run in to issues.
> >>> As to limits... I guess that will depend on the amount of memory on the
> >>> JT machine, the size of the cluster (Number of TT) and the number of
> >>> counters.
> >>>
> >>> In terms of global accessibility... Maybe.
> >>>
> >>> The reason I say maybe is that I'm not sure by what you mean by
> globally
> >>> accessible.
> >>> If a task creates and implements a dynamic counter... I know that it
> will
> >>> eventually be reflected in the JT. However, I do not believe that a
> separate
> >>> Task could connect with the JT and see if the counter exists or if it
> could
> >>> get a value or even an accurate value since the updates are
> asynchronous.
> >>> Not to mention that I don't believe that the counters are aggregated
> until
> >>> the job ends. It would make sense that the JT maintains a unique
> counter for
> >>> each task until the tasks complete. (If a task fails, it would have to
> >>> delete the counters so that when the task is restarted the correct
> count is
> >>> maintained. )  Note, I haven't looked at the source code so I am
> probably
> >>> wrong.
> >>>
> >>> HTH
> >>> Mike
> >>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
> >>>
> >>> Hi guys,
> >>>
> >>> I have some quick questions regarding to Hadoop counter,
> >>>
> >>> Hadoop counter (customer defined) is global accessible (for both read
> and
> >>> write) for all Mappers and Reducers in a job?
> >>> What is the performance and best practices of using Hadoop counters? I
> am
> >>> not sure if using Hadoop counters too heavy, there will be performance
> >>> downgrade to the whole job?
> >>>
> >>> regards,
> >>> Lin
> >>>
> >>>
> >>
> >>
> >>
> >> --
> >> Bertrand Dechoux
> >
> >
> >
> >
> > --
> > Jay Vyas
> > http://jayunit100.blogspot.com
>
>
>
> --
> Harsh J
>

Re: Hadoop counter

Posted by Jay Vyas <ja...@gmail.com>.

That clarifies why counters can go down during a job.  Very important
clarification because being able to rely on such "empemeral" counters is a
really important tool for realtime monitoring of failures:   thanks guys

Re: Hadoop counter

Posted by Jay Vyas <ja...@gmail.com>.

That clarifies why counters can go down during a job.  Very important
clarification because being able to rely on such "empemeral" counters is a
really important tool for realtime monitoring of failures:   thanks guys

Re: Hadoop counter

Posted by Jay Vyas <ja...@gmail.com>.

That clarifies why counters can go down during a job.  Very important
clarification because being able to rely on such "empemeral" counters is a
really important tool for realtime monitoring of failures:   thanks guys

Re: Hadoop counter

Posted by Jay Vyas <ja...@gmail.com>.

That clarifies why counters can go down during a job.  Very important
clarification because being able to rely on such "empemeral" counters is a
really important tool for realtime monitoring of failures:   thanks guys

Re: Hadoop counter

Posted by Bejoy KS <be...@gmail.com>.

Thanks Harsh. Great learning from you as always. :)

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Harsh J <ha...@cloudera.com>
Date: Fri, 19 Oct 2012 21:20:07 
To: <us...@hadoop.apache.org>; <be...@gmail.com>
Reply-To: user@hadoop.apache.org
Subject: Re: Hadoop counter

Bejoy is almost right, except that counters are reported upon progress
of tasks itself (via TT heartbeats to JT actually), but the final
counter representation is computed only with successful task reports
the job received, not from any failed or killed ones.

On Fri, Oct 19, 2012 at 8:51 PM, Bejoy KS <be...@gmail.com> wrote:
> Hi Jay
>
> Counters are reported at the end of a task to JT. So if a task fails the
> counters from that task are not send to JT and hence won't be included in
> the final value of counters from that Job.
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ________________________________
> From: Jay Vyas <ja...@gmail.com>
> Date: Fri, 19 Oct 2012 10:18:42 -0500
> To: <us...@hadoop.apache.org>
> ReplyTo: user@hadoop.apache.org
> Subject: Re: Hadoop counter
>
> Ah this answers alot about why some of my dynamic counters never show up and
> i have to bite my nails waiting to see whats going on until the end of the
> job- thanks.
>
> Another question: what happens if a task fails ?  What happen to the
> counters for it ?  Do they dissappear into the ether? Or do they get merged
> in with the counters from other tasks?
>
> On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <de...@gmail.com>
> wrote:
>>
>> And by default the number of counters is limited to 120 with the
>> mapreduce.job.counters.limit property.
>> They are useful for displaying short statistics about a job but should not
>> be used for results (imho).
>> I know people may misuse them but I haven't tried so I wouldn't be able to
>> list the caveats.
>>
>> Regards
>>
>> Bertrand
>>
>>
>> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel <mi...@hotmail.com>
>> wrote:
>>>
>>> As I understand it... each Task has its own counters and are
>>> independently updated. As they report back to the JT, they update the
>>> counter(s)' status.
>>> The JT then will aggregate them.
>>>
>>> In terms of performance, Counters take up some memory in the JT so while
>>> its OK to use them, if you abuse them, you can run in to issues.
>>> As to limits... I guess that will depend on the amount of memory on the
>>> JT machine, the size of the cluster (Number of TT) and the number of
>>> counters.
>>>
>>> In terms of global accessibility... Maybe.
>>>
>>> The reason I say maybe is that I'm not sure by what you mean by globally
>>> accessible.
>>> If a task creates and implements a dynamic counter... I know that it will
>>> eventually be reflected in the JT. However, I do not believe that a separate
>>> Task could connect with the JT and see if the counter exists or if it could
>>> get a value or even an accurate value since the updates are asynchronous.
>>> Not to mention that I don't believe that the counters are aggregated until
>>> the job ends. It would make sense that the JT maintains a unique counter for
>>> each task until the tasks complete. (If a task fails, it would have to
>>> delete the counters so that when the task is restarted the correct count is
>>> maintained. )  Note, I haven't looked at the source code so I am probably
>>> wrong.
>>>
>>> HTH
>>> Mike
>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>
>>> Hi guys,
>>>
>>> I have some quick questions regarding to Hadoop counter,
>>>
>>> Hadoop counter (customer defined) is global accessible (for both read and
>>> write) for all Mappers and Reducers in a job?
>>> What is the performance and best practices of using Hadoop counters? I am
>>> not sure if using Hadoop counters too heavy, there will be performance
>>> downgrade to the whole job?
>>>
>>> regards,
>>> Lin
>>>
>>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>
>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com



-- 
Harsh J

Re: Hadoop counter

Posted by Bejoy KS <be...@gmail.com>.

Thanks Harsh. Great learning from you as always. :)

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Harsh J <ha...@cloudera.com>
Date: Fri, 19 Oct 2012 21:20:07 
To: <us...@hadoop.apache.org>; <be...@gmail.com>
Reply-To: user@hadoop.apache.org
Subject: Re: Hadoop counter

Bejoy is almost right, except that counters are reported upon progress
of tasks itself (via TT heartbeats to JT actually), but the final
counter representation is computed only with successful task reports
the job received, not from any failed or killed ones.

On Fri, Oct 19, 2012 at 8:51 PM, Bejoy KS <be...@gmail.com> wrote:
> Hi Jay
>
> Counters are reported at the end of a task to JT. So if a task fails the
> counters from that task are not send to JT and hence won't be included in
> the final value of counters from that Job.
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ________________________________
> From: Jay Vyas <ja...@gmail.com>
> Date: Fri, 19 Oct 2012 10:18:42 -0500
> To: <us...@hadoop.apache.org>
> ReplyTo: user@hadoop.apache.org
> Subject: Re: Hadoop counter
>
> Ah this answers alot about why some of my dynamic counters never show up and
> i have to bite my nails waiting to see whats going on until the end of the
> job- thanks.
>
> Another question: what happens if a task fails ?  What happen to the
> counters for it ?  Do they dissappear into the ether? Or do they get merged
> in with the counters from other tasks?
>
> On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <de...@gmail.com>
> wrote:
>>
>> And by default the number of counters is limited to 120 with the
>> mapreduce.job.counters.limit property.
>> They are useful for displaying short statistics about a job but should not
>> be used for results (imho).
>> I know people may misuse them but I haven't tried so I wouldn't be able to
>> list the caveats.
>>
>> Regards
>>
>> Bertrand
>>
>>
>> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel <mi...@hotmail.com>
>> wrote:
>>>
>>> As I understand it... each Task has its own counters and are
>>> independently updated. As they report back to the JT, they update the
>>> counter(s)' status.
>>> The JT then will aggregate them.
>>>
>>> In terms of performance, Counters take up some memory in the JT so while
>>> its OK to use them, if you abuse them, you can run in to issues.
>>> As to limits... I guess that will depend on the amount of memory on the
>>> JT machine, the size of the cluster (Number of TT) and the number of
>>> counters.
>>>
>>> In terms of global accessibility... Maybe.
>>>
>>> The reason I say maybe is that I'm not sure by what you mean by globally
>>> accessible.
>>> If a task creates and implements a dynamic counter... I know that it will
>>> eventually be reflected in the JT. However, I do not believe that a separate
>>> Task could connect with the JT and see if the counter exists or if it could
>>> get a value or even an accurate value since the updates are asynchronous.
>>> Not to mention that I don't believe that the counters are aggregated until
>>> the job ends. It would make sense that the JT maintains a unique counter for
>>> each task until the tasks complete. (If a task fails, it would have to
>>> delete the counters so that when the task is restarted the correct count is
>>> maintained. )  Note, I haven't looked at the source code so I am probably
>>> wrong.
>>>
>>> HTH
>>> Mike
>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>
>>> Hi guys,
>>>
>>> I have some quick questions regarding to Hadoop counter,
>>>
>>> Hadoop counter (customer defined) is global accessible (for both read and
>>> write) for all Mappers and Reducers in a job?
>>> What is the performance and best practices of using Hadoop counters? I am
>>> not sure if using Hadoop counters too heavy, there will be performance
>>> downgrade to the whole job?
>>>
>>> regards,
>>> Lin
>>>
>>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>
>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com



-- 
Harsh J

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Hi Harsh,

Thanks for the great reply. Two basic questions,

- Where the counters' value are stored for successful job? On JT?
- Supposing a specific job A completed successfully and updated related
counters, is it possible for another specific job B to read counters
updated by previous job A? If yes, how?

regards,
Lin

On Fri, Oct 19, 2012 at 11:50 PM, Harsh J <ha...@cloudera.com> wrote:

> Bejoy is almost right, except that counters are reported upon progress
> of tasks itself (via TT heartbeats to JT actually), but the final
> counter representation is computed only with successful task reports
> the job received, not from any failed or killed ones.
>
> On Fri, Oct 19, 2012 at 8:51 PM, Bejoy KS <be...@gmail.com> wrote:
> > Hi Jay
> >
> > Counters are reported at the end of a task to JT. So if a task fails the
> > counters from that task are not send to JT and hence won't be included in
> > the final value of counters from that Job.
> > Regards
> > Bejoy KS
> >
> > Sent from handheld, please excuse typos.
> > ________________________________
> > From: Jay Vyas <ja...@gmail.com>
> > Date: Fri, 19 Oct 2012 10:18:42 -0500
> > To: <us...@hadoop.apache.org>
> > ReplyTo: user@hadoop.apache.org
> > Subject: Re: Hadoop counter
> >
> > Ah this answers alot about why some of my dynamic counters never show up
> and
> > i have to bite my nails waiting to see whats going on until the end of
> the
> > job- thanks.
> >
> > Another question: what happens if a task fails ?  What happen to the
> > counters for it ?  Do they dissappear into the ether? Or do they get
> merged
> > in with the counters from other tasks?
> >
> > On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <de...@gmail.com>
> > wrote:
> >>
> >> And by default the number of counters is limited to 120 with the
> >> mapreduce.job.counters.limit property.
> >> They are useful for displaying short statistics about a job but should
> not
> >> be used for results (imho).
> >> I know people may misuse them but I haven't tried so I wouldn't be able
> to
> >> list the caveats.
> >>
> >> Regards
> >>
> >> Bertrand
> >>
> >>
> >> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel <
> michael_segel@hotmail.com>
> >> wrote:
> >>>
> >>> As I understand it... each Task has its own counters and are
> >>> independently updated. As they report back to the JT, they update the
> >>> counter(s)' status.
> >>> The JT then will aggregate them.
> >>>
> >>> In terms of performance, Counters take up some memory in the JT so
> while
> >>> its OK to use them, if you abuse them, you can run in to issues.
> >>> As to limits... I guess that will depend on the amount of memory on the
> >>> JT machine, the size of the cluster (Number of TT) and the number of
> >>> counters.
> >>>
> >>> In terms of global accessibility... Maybe.
> >>>
> >>> The reason I say maybe is that I'm not sure by what you mean by
> globally
> >>> accessible.
> >>> If a task creates and implements a dynamic counter... I know that it
> will
> >>> eventually be reflected in the JT. However, I do not believe that a
> separate
> >>> Task could connect with the JT and see if the counter exists or if it
> could
> >>> get a value or even an accurate value since the updates are
> asynchronous.
> >>> Not to mention that I don't believe that the counters are aggregated
> until
> >>> the job ends. It would make sense that the JT maintains a unique
> counter for
> >>> each task until the tasks complete. (If a task fails, it would have to
> >>> delete the counters so that when the task is restarted the correct
> count is
> >>> maintained. )  Note, I haven't looked at the source code so I am
> probably
> >>> wrong.
> >>>
> >>> HTH
> >>> Mike
> >>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
> >>>
> >>> Hi guys,
> >>>
> >>> I have some quick questions regarding to Hadoop counter,
> >>>
> >>> Hadoop counter (customer defined) is global accessible (for both read
> and
> >>> write) for all Mappers and Reducers in a job?
> >>> What is the performance and best practices of using Hadoop counters? I
> am
> >>> not sure if using Hadoop counters too heavy, there will be performance
> >>> downgrade to the whole job?
> >>>
> >>> regards,
> >>> Lin
> >>>
> >>>
> >>
> >>
> >>
> >> --
> >> Bertrand Dechoux
> >
> >
> >
> >
> > --
> > Jay Vyas
> > http://jayunit100.blogspot.com
>
>
>
> --
> Harsh J
>

Re: Hadoop counter

Posted by Bejoy KS <be...@gmail.com>.

Thanks Harsh. Great learning from you as always. :)

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Harsh J <ha...@cloudera.com>
Date: Fri, 19 Oct 2012 21:20:07 
To: <us...@hadoop.apache.org>; <be...@gmail.com>
Reply-To: user@hadoop.apache.org
Subject: Re: Hadoop counter

Bejoy is almost right, except that counters are reported upon progress
of tasks itself (via TT heartbeats to JT actually), but the final
counter representation is computed only with successful task reports
the job received, not from any failed or killed ones.

On Fri, Oct 19, 2012 at 8:51 PM, Bejoy KS <be...@gmail.com> wrote:
> Hi Jay
>
> Counters are reported at the end of a task to JT. So if a task fails the
> counters from that task are not send to JT and hence won't be included in
> the final value of counters from that Job.
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ________________________________
> From: Jay Vyas <ja...@gmail.com>
> Date: Fri, 19 Oct 2012 10:18:42 -0500
> To: <us...@hadoop.apache.org>
> ReplyTo: user@hadoop.apache.org
> Subject: Re: Hadoop counter
>
> Ah this answers alot about why some of my dynamic counters never show up and
> i have to bite my nails waiting to see whats going on until the end of the
> job- thanks.
>
> Another question: what happens if a task fails ?  What happen to the
> counters for it ?  Do they dissappear into the ether? Or do they get merged
> in with the counters from other tasks?
>
> On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <de...@gmail.com>
> wrote:
>>
>> And by default the number of counters is limited to 120 with the
>> mapreduce.job.counters.limit property.
>> They are useful for displaying short statistics about a job but should not
>> be used for results (imho).
>> I know people may misuse them but I haven't tried so I wouldn't be able to
>> list the caveats.
>>
>> Regards
>>
>> Bertrand
>>
>>
>> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel <mi...@hotmail.com>
>> wrote:
>>>
>>> As I understand it... each Task has its own counters and are
>>> independently updated. As they report back to the JT, they update the
>>> counter(s)' status.
>>> The JT then will aggregate them.
>>>
>>> In terms of performance, Counters take up some memory in the JT so while
>>> its OK to use them, if you abuse them, you can run in to issues.
>>> As to limits... I guess that will depend on the amount of memory on the
>>> JT machine, the size of the cluster (Number of TT) and the number of
>>> counters.
>>>
>>> In terms of global accessibility... Maybe.
>>>
>>> The reason I say maybe is that I'm not sure by what you mean by globally
>>> accessible.
>>> If a task creates and implements a dynamic counter... I know that it will
>>> eventually be reflected in the JT. However, I do not believe that a separate
>>> Task could connect with the JT and see if the counter exists or if it could
>>> get a value or even an accurate value since the updates are asynchronous.
>>> Not to mention that I don't believe that the counters are aggregated until
>>> the job ends. It would make sense that the JT maintains a unique counter for
>>> each task until the tasks complete. (If a task fails, it would have to
>>> delete the counters so that when the task is restarted the correct count is
>>> maintained. )  Note, I haven't looked at the source code so I am probably
>>> wrong.
>>>
>>> HTH
>>> Mike
>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>
>>> Hi guys,
>>>
>>> I have some quick questions regarding to Hadoop counter,
>>>
>>> Hadoop counter (customer defined) is global accessible (for both read and
>>> write) for all Mappers and Reducers in a job?
>>> What is the performance and best practices of using Hadoop counters? I am
>>> not sure if using Hadoop counters too heavy, there will be performance
>>> downgrade to the whole job?
>>>
>>> regards,
>>> Lin
>>>
>>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>
>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com



-- 
Harsh J

Re: Hadoop counter

Posted by Lin Ma <li...@gmail.com>.

Hi Harsh,

Thanks for the great reply. Two basic questions,

- Where the counters' value are stored for successful job? On JT?
- Supposing a specific job A completed successfully and updated related
counters, is it possible for another specific job B to read counters
updated by previous job A? If yes, how?

regards,
Lin

On Fri, Oct 19, 2012 at 11:50 PM, Harsh J <ha...@cloudera.com> wrote:

> Bejoy is almost right, except that counters are reported upon progress
> of tasks itself (via TT heartbeats to JT actually), but the final
> counter representation is computed only with successful task reports
> the job received, not from any failed or killed ones.
>
> On Fri, Oct 19, 2012 at 8:51 PM, Bejoy KS <be...@gmail.com> wrote:
> > Hi Jay
> >
> > Counters are reported at the end of a task to JT. So if a task fails the
> > counters from that task are not send to JT and hence won't be included in
> > the final value of counters from that Job.
> > Regards
> > Bejoy KS
> >
> > Sent from handheld, please excuse typos.
> > ________________________________
> > From: Jay Vyas <ja...@gmail.com>
> > Date: Fri, 19 Oct 2012 10:18:42 -0500
> > To: <us...@hadoop.apache.org>
> > ReplyTo: user@hadoop.apache.org
> > Subject: Re: Hadoop counter
> >
> > Ah this answers alot about why some of my dynamic counters never show up
> and
> > i have to bite my nails waiting to see whats going on until the end of
> the
> > job- thanks.
> >
> > Another question: what happens if a task fails ?  What happen to the
> > counters for it ?  Do they dissappear into the ether? Or do they get
> merged
> > in with the counters from other tasks?
> >
> > On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <de...@gmail.com>
> > wrote:
> >>
> >> And by default the number of counters is limited to 120 with the
> >> mapreduce.job.counters.limit property.
> >> They are useful for displaying short statistics about a job but should
> not
> >> be used for results (imho).
> >> I know people may misuse them but I haven't tried so I wouldn't be able
> to
> >> list the caveats.
> >>
> >> Regards
> >>
> >> Bertrand
> >>
> >>
> >> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel <
> michael_segel@hotmail.com>
> >> wrote:
> >>>
> >>> As I understand it... each Task has its own counters and are
> >>> independently updated. As they report back to the JT, they update the
> >>> counter(s)' status.
> >>> The JT then will aggregate them.
> >>>
> >>> In terms of performance, Counters take up some memory in the JT so
> while
> >>> its OK to use them, if you abuse them, you can run in to issues.
> >>> As to limits... I guess that will depend on the amount of memory on the
> >>> JT machine, the size of the cluster (Number of TT) and the number of
> >>> counters.
> >>>
> >>> In terms of global accessibility... Maybe.
> >>>
> >>> The reason I say maybe is that I'm not sure by what you mean by
> globally
> >>> accessible.
> >>> If a task creates and implements a dynamic counter... I know that it
> will
> >>> eventually be reflected in the JT. However, I do not believe that a
> separate
> >>> Task could connect with the JT and see if the counter exists or if it
> could
> >>> get a value or even an accurate value since the updates are
> asynchronous.
> >>> Not to mention that I don't believe that the counters are aggregated
> until
> >>> the job ends. It would make sense that the JT maintains a unique
> counter for
> >>> each task until the tasks complete. (If a task fails, it would have to
> >>> delete the counters so that when the task is restarted the correct
> count is
> >>> maintained. )  Note, I haven't looked at the source code so I am
> probably
> >>> wrong.
> >>>
> >>> HTH
> >>> Mike
> >>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
> >>>
> >>> Hi guys,
> >>>
> >>> I have some quick questions regarding to Hadoop counter,
> >>>
> >>> Hadoop counter (customer defined) is global accessible (for both read
> and
> >>> write) for all Mappers and Reducers in a job?
> >>> What is the performance and best practices of using Hadoop counters? I
> am
> >>> not sure if using Hadoop counters too heavy, there will be performance
> >>> downgrade to the whole job?
> >>>
> >>> regards,
> >>> Lin
> >>>
> >>>
> >>
> >>
> >>
> >> --
> >> Bertrand Dechoux
> >
> >
> >
> >
> > --
> > Jay Vyas
> > http://jayunit100.blogspot.com
>
>
>
> --
> Harsh J
>

Re: Hadoop counter

Posted by Bejoy KS <be...@gmail.com>.

Thanks Harsh. Great learning from you as always. :)

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Harsh J <ha...@cloudera.com>
Date: Fri, 19 Oct 2012 21:20:07 
To: <us...@hadoop.apache.org>; <be...@gmail.com>
Reply-To: user@hadoop.apache.org
Subject: Re: Hadoop counter

Bejoy is almost right, except that counters are reported upon progress
of tasks itself (via TT heartbeats to JT actually), but the final
counter representation is computed only with successful task reports
the job received, not from any failed or killed ones.

On Fri, Oct 19, 2012 at 8:51 PM, Bejoy KS <be...@gmail.com> wrote:
> Hi Jay
>
> Counters are reported at the end of a task to JT. So if a task fails the
> counters from that task are not send to JT and hence won't be included in
> the final value of counters from that Job.
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ________________________________
> From: Jay Vyas <ja...@gmail.com>
> Date: Fri, 19 Oct 2012 10:18:42 -0500
> To: <us...@hadoop.apache.org>
> ReplyTo: user@hadoop.apache.org
> Subject: Re: Hadoop counter
>
> Ah this answers alot about why some of my dynamic counters never show up and
> i have to bite my nails waiting to see whats going on until the end of the
> job- thanks.
>
> Another question: what happens if a task fails ?  What happen to the
> counters for it ?  Do they dissappear into the ether? Or do they get merged
> in with the counters from other tasks?
>
> On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <de...@gmail.com>
> wrote:
>>
>> And by default the number of counters is limited to 120 with the
>> mapreduce.job.counters.limit property.
>> They are useful for displaying short statistics about a job but should not
>> be used for results (imho).
>> I know people may misuse them but I haven't tried so I wouldn't be able to
>> list the caveats.
>>
>> Regards
>>
>> Bertrand
>>
>>
>> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel <mi...@hotmail.com>
>> wrote:
>>>
>>> As I understand it... each Task has its own counters and are
>>> independently updated. As they report back to the JT, they update the
>>> counter(s)' status.
>>> The JT then will aggregate them.
>>>
>>> In terms of performance, Counters take up some memory in the JT so while
>>> its OK to use them, if you abuse them, you can run in to issues.
>>> As to limits... I guess that will depend on the amount of memory on the
>>> JT machine, the size of the cluster (Number of TT) and the number of
>>> counters.
>>>
>>> In terms of global accessibility... Maybe.
>>>
>>> The reason I say maybe is that I'm not sure by what you mean by globally
>>> accessible.
>>> If a task creates and implements a dynamic counter... I know that it will
>>> eventually be reflected in the JT. However, I do not believe that a separate
>>> Task could connect with the JT and see if the counter exists or if it could
>>> get a value or even an accurate value since the updates are asynchronous.
>>> Not to mention that I don't believe that the counters are aggregated until
>>> the job ends. It would make sense that the JT maintains a unique counter for
>>> each task until the tasks complete. (If a task fails, it would have to
>>> delete the counters so that when the task is restarted the correct count is
>>> maintained. )  Note, I haven't looked at the source code so I am probably
>>> wrong.
>>>
>>> HTH
>>> Mike
>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>
>>> Hi guys,
>>>
>>> I have some quick questions regarding to Hadoop counter,
>>>
>>> Hadoop counter (customer defined) is global accessible (for both read and
>>> write) for all Mappers and Reducers in a job?
>>> What is the performance and best practices of using Hadoop counters? I am
>>> not sure if using Hadoop counters too heavy, there will be performance
>>> downgrade to the whole job?
>>>
>>> regards,
>>> Lin
>>>
>>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>
>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com



-- 
Harsh J

Re: Hadoop counter

Posted by Harsh J <ha...@cloudera.com>.

Bejoy is almost right, except that counters are reported upon progress
of tasks itself (via TT heartbeats to JT actually), but the final
counter representation is computed only with successful task reports
the job received, not from any failed or killed ones.

On Fri, Oct 19, 2012 at 8:51 PM, Bejoy KS <be...@gmail.com> wrote:
> Hi Jay
>
> Counters are reported at the end of a task to JT. So if a task fails the
> counters from that task are not send to JT and hence won't be included in
> the final value of counters from that Job.
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ________________________________
> From: Jay Vyas <ja...@gmail.com>
> Date: Fri, 19 Oct 2012 10:18:42 -0500
> To: <us...@hadoop.apache.org>
> ReplyTo: user@hadoop.apache.org
> Subject: Re: Hadoop counter
>
> Ah this answers alot about why some of my dynamic counters never show up and
> i have to bite my nails waiting to see whats going on until the end of the
> job- thanks.
>
> Another question: what happens if a task fails ?  What happen to the
> counters for it ?  Do they dissappear into the ether? Or do they get merged
> in with the counters from other tasks?
>
> On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <de...@gmail.com>
> wrote:
>>
>> And by default the number of counters is limited to 120 with the
>> mapreduce.job.counters.limit property.
>> They are useful for displaying short statistics about a job but should not
>> be used for results (imho).
>> I know people may misuse them but I haven't tried so I wouldn't be able to
>> list the caveats.
>>
>> Regards
>>
>> Bertrand
>>
>>
>> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel <mi...@hotmail.com>
>> wrote:
>>>
>>> As I understand it... each Task has its own counters and are
>>> independently updated. As they report back to the JT, they update the
>>> counter(s)' status.
>>> The JT then will aggregate them.
>>>
>>> In terms of performance, Counters take up some memory in the JT so while
>>> its OK to use them, if you abuse them, you can run in to issues.
>>> As to limits... I guess that will depend on the amount of memory on the
>>> JT machine, the size of the cluster (Number of TT) and the number of
>>> counters.
>>>
>>> In terms of global accessibility... Maybe.
>>>
>>> The reason I say maybe is that I'm not sure by what you mean by globally
>>> accessible.
>>> If a task creates and implements a dynamic counter... I know that it will
>>> eventually be reflected in the JT. However, I do not believe that a separate
>>> Task could connect with the JT and see if the counter exists or if it could
>>> get a value or even an accurate value since the updates are asynchronous.
>>> Not to mention that I don't believe that the counters are aggregated until
>>> the job ends. It would make sense that the JT maintains a unique counter for
>>> each task until the tasks complete. (If a task fails, it would have to
>>> delete the counters so that when the task is restarted the correct count is
>>> maintained. )  Note, I haven't looked at the source code so I am probably
>>> wrong.
>>>
>>> HTH
>>> Mike
>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>
>>> Hi guys,
>>>
>>> I have some quick questions regarding to Hadoop counter,
>>>
>>> Hadoop counter (customer defined) is global accessible (for both read and
>>> write) for all Mappers and Reducers in a job?
>>> What is the performance and best practices of using Hadoop counters? I am
>>> not sure if using Hadoop counters too heavy, there will be performance
>>> downgrade to the whole job?
>>>
>>> regards,
>>> Lin
>>>
>>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>
>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com



-- 
Harsh J

Re: Hadoop counter

Posted by Harsh J <ha...@cloudera.com>.

Bejoy is almost right, except that counters are reported upon progress
of tasks itself (via TT heartbeats to JT actually), but the final
counter representation is computed only with successful task reports
the job received, not from any failed or killed ones.

On Fri, Oct 19, 2012 at 8:51 PM, Bejoy KS <be...@gmail.com> wrote:
> Hi Jay
>
> Counters are reported at the end of a task to JT. So if a task fails the
> counters from that task are not send to JT and hence won't be included in
> the final value of counters from that Job.
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ________________________________
> From: Jay Vyas <ja...@gmail.com>
> Date: Fri, 19 Oct 2012 10:18:42 -0500
> To: <us...@hadoop.apache.org>
> ReplyTo: user@hadoop.apache.org
> Subject: Re: Hadoop counter
>
> Ah this answers alot about why some of my dynamic counters never show up and
> i have to bite my nails waiting to see whats going on until the end of the
> job- thanks.
>
> Another question: what happens if a task fails ?  What happen to the
> counters for it ?  Do they dissappear into the ether? Or do they get merged
> in with the counters from other tasks?
>
> On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <de...@gmail.com>
> wrote:
>>
>> And by default the number of counters is limited to 120 with the
>> mapreduce.job.counters.limit property.
>> They are useful for displaying short statistics about a job but should not
>> be used for results (imho).
>> I know people may misuse them but I haven't tried so I wouldn't be able to
>> list the caveats.
>>
>> Regards
>>
>> Bertrand
>>
>>
>> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel <mi...@hotmail.com>
>> wrote:
>>>
>>> As I understand it... each Task has its own counters and are
>>> independently updated. As they report back to the JT, they update the
>>> counter(s)' status.
>>> The JT then will aggregate them.
>>>
>>> In terms of performance, Counters take up some memory in the JT so while
>>> its OK to use them, if you abuse them, you can run in to issues.
>>> As to limits... I guess that will depend on the amount of memory on the
>>> JT machine, the size of the cluster (Number of TT) and the number of
>>> counters.
>>>
>>> In terms of global accessibility... Maybe.
>>>
>>> The reason I say maybe is that I'm not sure by what you mean by globally
>>> accessible.
>>> If a task creates and implements a dynamic counter... I know that it will
>>> eventually be reflected in the JT. However, I do not believe that a separate
>>> Task could connect with the JT and see if the counter exists or if it could
>>> get a value or even an accurate value since the updates are asynchronous.
>>> Not to mention that I don't believe that the counters are aggregated until
>>> the job ends. It would make sense that the JT maintains a unique counter for
>>> each task until the tasks complete. (If a task fails, it would have to
>>> delete the counters so that when the task is restarted the correct count is
>>> maintained. )  Note, I haven't looked at the source code so I am probably
>>> wrong.
>>>
>>> HTH
>>> Mike
>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>
>>> Hi guys,
>>>
>>> I have some quick questions regarding to Hadoop counter,
>>>
>>> Hadoop counter (customer defined) is global accessible (for both read and
>>> write) for all Mappers and Reducers in a job?
>>> What is the performance and best practices of using Hadoop counters? I am
>>> not sure if using Hadoop counters too heavy, there will be performance
>>> downgrade to the whole job?
>>>
>>> regards,
>>> Lin
>>>
>>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>
>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com



-- 
Harsh J

Re: Hadoop counter

Posted by Harsh J <ha...@cloudera.com>.

Bejoy is almost right, except that counters are reported upon progress
of tasks itself (via TT heartbeats to JT actually), but the final
counter representation is computed only with successful task reports
the job received, not from any failed or killed ones.

On Fri, Oct 19, 2012 at 8:51 PM, Bejoy KS <be...@gmail.com> wrote:
> Hi Jay
>
> Counters are reported at the end of a task to JT. So if a task fails the
> counters from that task are not send to JT and hence won't be included in
> the final value of counters from that Job.
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ________________________________
> From: Jay Vyas <ja...@gmail.com>
> Date: Fri, 19 Oct 2012 10:18:42 -0500
> To: <us...@hadoop.apache.org>
> ReplyTo: user@hadoop.apache.org
> Subject: Re: Hadoop counter
>
> Ah this answers alot about why some of my dynamic counters never show up and
> i have to bite my nails waiting to see whats going on until the end of the
> job- thanks.
>
> Another question: what happens if a task fails ?  What happen to the
> counters for it ?  Do they dissappear into the ether? Or do they get merged
> in with the counters from other tasks?
>
> On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <de...@gmail.com>
> wrote:
>>
>> And by default the number of counters is limited to 120 with the
>> mapreduce.job.counters.limit property.
>> They are useful for displaying short statistics about a job but should not
>> be used for results (imho).
>> I know people may misuse them but I haven't tried so I wouldn't be able to
>> list the caveats.
>>
>> Regards
>>
>> Bertrand
>>
>>
>> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel <mi...@hotmail.com>
>> wrote:
>>>
>>> As I understand it... each Task has its own counters and are
>>> independently updated. As they report back to the JT, they update the
>>> counter(s)' status.
>>> The JT then will aggregate them.
>>>
>>> In terms of performance, Counters take up some memory in the JT so while
>>> its OK to use them, if you abuse them, you can run in to issues.
>>> As to limits... I guess that will depend on the amount of memory on the
>>> JT machine, the size of the cluster (Number of TT) and the number of
>>> counters.
>>>
>>> In terms of global accessibility... Maybe.
>>>
>>> The reason I say maybe is that I'm not sure by what you mean by globally
>>> accessible.
>>> If a task creates and implements a dynamic counter... I know that it will
>>> eventually be reflected in the JT. However, I do not believe that a separate
>>> Task could connect with the JT and see if the counter exists or if it could
>>> get a value or even an accurate value since the updates are asynchronous.
>>> Not to mention that I don't believe that the counters are aggregated until
>>> the job ends. It would make sense that the JT maintains a unique counter for
>>> each task until the tasks complete. (If a task fails, it would have to
>>> delete the counters so that when the task is restarted the correct count is
>>> maintained. )  Note, I haven't looked at the source code so I am probably
>>> wrong.
>>>
>>> HTH
>>> Mike
>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>
>>> Hi guys,
>>>
>>> I have some quick questions regarding to Hadoop counter,
>>>
>>> Hadoop counter (customer defined) is global accessible (for both read and
>>> write) for all Mappers and Reducers in a job?
>>> What is the performance and best practices of using Hadoop counters? I am
>>> not sure if using Hadoop counters too heavy, there will be performance
>>> downgrade to the whole job?
>>>
>>> regards,
>>> Lin
>>>
>>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>
>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com



-- 
Harsh J

Re: Hadoop counter

Posted by Harsh J <ha...@cloudera.com>.

Bejoy is almost right, except that counters are reported upon progress
of tasks itself (via TT heartbeats to JT actually), but the final
counter representation is computed only with successful task reports
the job received, not from any failed or killed ones.

On Fri, Oct 19, 2012 at 8:51 PM, Bejoy KS <be...@gmail.com> wrote:
> Hi Jay
>
> Counters are reported at the end of a task to JT. So if a task fails the
> counters from that task are not send to JT and hence won't be included in
> the final value of counters from that Job.
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ________________________________
> From: Jay Vyas <ja...@gmail.com>
> Date: Fri, 19 Oct 2012 10:18:42 -0500
> To: <us...@hadoop.apache.org>
> ReplyTo: user@hadoop.apache.org
> Subject: Re: Hadoop counter
>
> Ah this answers alot about why some of my dynamic counters never show up and
> i have to bite my nails waiting to see whats going on until the end of the
> job- thanks.
>
> Another question: what happens if a task fails ?  What happen to the
> counters for it ?  Do they dissappear into the ether? Or do they get merged
> in with the counters from other tasks?
>
> On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <de...@gmail.com>
> wrote:
>>
>> And by default the number of counters is limited to 120 with the
>> mapreduce.job.counters.limit property.
>> They are useful for displaying short statistics about a job but should not
>> be used for results (imho).
>> I know people may misuse them but I haven't tried so I wouldn't be able to
>> list the caveats.
>>
>> Regards
>>
>> Bertrand
>>
>>
>> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel <mi...@hotmail.com>
>> wrote:
>>>
>>> As I understand it... each Task has its own counters and are
>>> independently updated. As they report back to the JT, they update the
>>> counter(s)' status.
>>> The JT then will aggregate them.
>>>
>>> In terms of performance, Counters take up some memory in the JT so while
>>> its OK to use them, if you abuse them, you can run in to issues.
>>> As to limits... I guess that will depend on the amount of memory on the
>>> JT machine, the size of the cluster (Number of TT) and the number of
>>> counters.
>>>
>>> In terms of global accessibility... Maybe.
>>>
>>> The reason I say maybe is that I'm not sure by what you mean by globally
>>> accessible.
>>> If a task creates and implements a dynamic counter... I know that it will
>>> eventually be reflected in the JT. However, I do not believe that a separate
>>> Task could connect with the JT and see if the counter exists or if it could
>>> get a value or even an accurate value since the updates are asynchronous.
>>> Not to mention that I don't believe that the counters are aggregated until
>>> the job ends. It would make sense that the JT maintains a unique counter for
>>> each task until the tasks complete. (If a task fails, it would have to
>>> delete the counters so that when the task is restarted the correct count is
>>> maintained. )  Note, I haven't looked at the source code so I am probably
>>> wrong.
>>>
>>> HTH
>>> Mike
>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>>
>>> Hi guys,
>>>
>>> I have some quick questions regarding to Hadoop counter,
>>>
>>> Hadoop counter (customer defined) is global accessible (for both read and
>>> write) for all Mappers and Reducers in a job?
>>> What is the performance and best practices of using Hadoop counters? I am
>>> not sure if using Hadoop counters too heavy, there will be performance
>>> downgrade to the whole job?
>>>
>>> regards,
>>> Lin
>>>
>>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>
>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com



-- 
Harsh J

Re: Hadoop counter

Posted by Bejoy KS <be...@gmail.com>.

Hi Jay

Counters are reported at the end of a task to JT. So if a task fails the counters from that task are not send to JT and hence won't be included in the final value of counters from that Job.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Jay Vyas <ja...@gmail.com>
Date: Fri, 19 Oct 2012 10:18:42 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: Hadoop counter

Ah this answers alot about why some of my dynamic counters never show up
and i have to bite my nails waiting to see whats going on until the end of
the job- thanks.

Another question: what happens if a task fails ?  What happen to the
counters for it ?  Do they dissappear into the ether? Or do they get merged
in with the counters from other tasks?

On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <de...@gmail.com>wrote:

> And by default the number of counters is limited to 120 with the
> mapreduce.job.counters.limit property.
> They are useful for displaying short statistics about a job but should not
> be used for results (imho).
> I know people may misuse them but I haven't tried so I wouldn't be able to
> list the caveats.
>
> Regards
>
> Bertrand
>
>
> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> As I understand it... each Task has its own counters and are
>> independently updated. As they report back to the JT, they update the
>> counter(s)' status.
>> The JT then will aggregate them.
>>
>> In terms of performance, Counters take up some memory in the JT so while
>> its OK to use them, if you abuse them, you can run in to issues.
>> As to limits... I guess that will depend on the amount of memory on the
>> JT machine, the size of the cluster (Number of TT) and the number of
>> counters.
>>
>> In terms of global accessibility... Maybe.
>>
>> The reason I say maybe is that I'm not sure by what you mean by globally
>> accessible.
>> If a task creates and implements a dynamic counter... I know that it will
>> eventually be reflected in the JT. However, I do not believe that a
>> separate Task could connect with the JT and see if the counter exists or if
>> it could get a value or even an accurate value since the updates are
>> asynchronous.  Not to mention that I don't believe that the counters are
>> aggregated until the job ends. It would make sense that the JT maintains a
>> unique counter for each task until the tasks complete. (If a task fails, it
>> would have to delete the counters so that when the task is restarted the
>> correct count is maintained. )  Note, I haven't looked at the source code
>> so I am probably wrong.
>>
>> HTH
>> Mike
>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>
>> Hi guys,
>>
>> I have some quick questions regarding to Hadoop counter,
>>
>>
>>    - Hadoop counter (customer defined) is global accessible (for both
>>    read and write) for all Mappers and Reducers in a job?
>>    - What is the performance and best practices of using Hadoop
>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>    performance downgrade to the whole job?
>>
>> regards,
>> Lin
>>
>>
>>
>
>
> --
> Bertrand Dechoux
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Hadoop counter

Posted by Bejoy KS <be...@gmail.com>.

Hi Jay

Counters are reported at the end of a task to JT. So if a task fails the counters from that task are not send to JT and hence won't be included in the final value of counters from that Job.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Jay Vyas <ja...@gmail.com>
Date: Fri, 19 Oct 2012 10:18:42 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: Hadoop counter

Ah this answers alot about why some of my dynamic counters never show up
and i have to bite my nails waiting to see whats going on until the end of
the job- thanks.

Another question: what happens if a task fails ?  What happen to the
counters for it ?  Do they dissappear into the ether? Or do they get merged
in with the counters from other tasks?

On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <de...@gmail.com>wrote:

> And by default the number of counters is limited to 120 with the
> mapreduce.job.counters.limit property.
> They are useful for displaying short statistics about a job but should not
> be used for results (imho).
> I know people may misuse them but I haven't tried so I wouldn't be able to
> list the caveats.
>
> Regards
>
> Bertrand
>
>
> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> As I understand it... each Task has its own counters and are
>> independently updated. As they report back to the JT, they update the
>> counter(s)' status.
>> The JT then will aggregate them.
>>
>> In terms of performance, Counters take up some memory in the JT so while
>> its OK to use them, if you abuse them, you can run in to issues.
>> As to limits... I guess that will depend on the amount of memory on the
>> JT machine, the size of the cluster (Number of TT) and the number of
>> counters.
>>
>> In terms of global accessibility... Maybe.
>>
>> The reason I say maybe is that I'm not sure by what you mean by globally
>> accessible.
>> If a task creates and implements a dynamic counter... I know that it will
>> eventually be reflected in the JT. However, I do not believe that a
>> separate Task could connect with the JT and see if the counter exists or if
>> it could get a value or even an accurate value since the updates are
>> asynchronous.  Not to mention that I don't believe that the counters are
>> aggregated until the job ends. It would make sense that the JT maintains a
>> unique counter for each task until the tasks complete. (If a task fails, it
>> would have to delete the counters so that when the task is restarted the
>> correct count is maintained. )  Note, I haven't looked at the source code
>> so I am probably wrong.
>>
>> HTH
>> Mike
>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>
>> Hi guys,
>>
>> I have some quick questions regarding to Hadoop counter,
>>
>>
>>    - Hadoop counter (customer defined) is global accessible (for both
>>    read and write) for all Mappers and Reducers in a job?
>>    - What is the performance and best practices of using Hadoop
>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>    performance downgrade to the whole job?
>>
>> regards,
>> Lin
>>
>>
>>
>
>
> --
> Bertrand Dechoux
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Hadoop counter

Posted by Bejoy KS <be...@gmail.com>.

Hi Jay

Counters are reported at the end of a task to JT. So if a task fails the counters from that task are not send to JT and hence won't be included in the final value of counters from that Job.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Jay Vyas <ja...@gmail.com>
Date: Fri, 19 Oct 2012 10:18:42 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: Hadoop counter

Ah this answers alot about why some of my dynamic counters never show up
and i have to bite my nails waiting to see whats going on until the end of
the job- thanks.

Another question: what happens if a task fails ?  What happen to the
counters for it ?  Do they dissappear into the ether? Or do they get merged
in with the counters from other tasks?

On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <de...@gmail.com>wrote:

> And by default the number of counters is limited to 120 with the
> mapreduce.job.counters.limit property.
> They are useful for displaying short statistics about a job but should not
> be used for results (imho).
> I know people may misuse them but I haven't tried so I wouldn't be able to
> list the caveats.
>
> Regards
>
> Bertrand
>
>
> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> As I understand it... each Task has its own counters and are
>> independently updated. As they report back to the JT, they update the
>> counter(s)' status.
>> The JT then will aggregate them.
>>
>> In terms of performance, Counters take up some memory in the JT so while
>> its OK to use them, if you abuse them, you can run in to issues.
>> As to limits... I guess that will depend on the amount of memory on the
>> JT machine, the size of the cluster (Number of TT) and the number of
>> counters.
>>
>> In terms of global accessibility... Maybe.
>>
>> The reason I say maybe is that I'm not sure by what you mean by globally
>> accessible.
>> If a task creates and implements a dynamic counter... I know that it will
>> eventually be reflected in the JT. However, I do not believe that a
>> separate Task could connect with the JT and see if the counter exists or if
>> it could get a value or even an accurate value since the updates are
>> asynchronous.  Not to mention that I don't believe that the counters are
>> aggregated until the job ends. It would make sense that the JT maintains a
>> unique counter for each task until the tasks complete. (If a task fails, it
>> would have to delete the counters so that when the task is restarted the
>> correct count is maintained. )  Note, I haven't looked at the source code
>> so I am probably wrong.
>>
>> HTH
>> Mike
>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>
>> Hi guys,
>>
>> I have some quick questions regarding to Hadoop counter,
>>
>>
>>    - Hadoop counter (customer defined) is global accessible (for both
>>    read and write) for all Mappers and Reducers in a job?
>>    - What is the performance and best practices of using Hadoop
>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>    performance downgrade to the whole job?
>>
>> regards,
>> Lin
>>
>>
>>
>
>
> --
> Bertrand Dechoux
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Hadoop counter

Posted by Bejoy KS <be...@gmail.com>.

Hi Jay

Counters are reported at the end of a task to JT. So if a task fails the counters from that task are not send to JT and hence won't be included in the final value of counters from that Job.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Jay Vyas <ja...@gmail.com>
Date: Fri, 19 Oct 2012 10:18:42 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: Hadoop counter

Ah this answers alot about why some of my dynamic counters never show up
and i have to bite my nails waiting to see whats going on until the end of
the job- thanks.

Another question: what happens if a task fails ?  What happen to the
counters for it ?  Do they dissappear into the ether? Or do they get merged
in with the counters from other tasks?

On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <de...@gmail.com>wrote:

> And by default the number of counters is limited to 120 with the
> mapreduce.job.counters.limit property.
> They are useful for displaying short statistics about a job but should not
> be used for results (imho).
> I know people may misuse them but I haven't tried so I wouldn't be able to
> list the caveats.
>
> Regards
>
> Bertrand
>
>
> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> As I understand it... each Task has its own counters and are
>> independently updated. As they report back to the JT, they update the
>> counter(s)' status.
>> The JT then will aggregate them.
>>
>> In terms of performance, Counters take up some memory in the JT so while
>> its OK to use them, if you abuse them, you can run in to issues.
>> As to limits... I guess that will depend on the amount of memory on the
>> JT machine, the size of the cluster (Number of TT) and the number of
>> counters.
>>
>> In terms of global accessibility... Maybe.
>>
>> The reason I say maybe is that I'm not sure by what you mean by globally
>> accessible.
>> If a task creates and implements a dynamic counter... I know that it will
>> eventually be reflected in the JT. However, I do not believe that a
>> separate Task could connect with the JT and see if the counter exists or if
>> it could get a value or even an accurate value since the updates are
>> asynchronous.  Not to mention that I don't believe that the counters are
>> aggregated until the job ends. It would make sense that the JT maintains a
>> unique counter for each task until the tasks complete. (If a task fails, it
>> would have to delete the counters so that when the task is restarted the
>> correct count is maintained. )  Note, I haven't looked at the source code
>> so I am probably wrong.
>>
>> HTH
>> Mike
>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>
>> Hi guys,
>>
>> I have some quick questions regarding to Hadoop counter,
>>
>>
>>    - Hadoop counter (customer defined) is global accessible (for both
>>    read and write) for all Mappers and Reducers in a job?
>>    - What is the performance and best practices of using Hadoop
>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>    performance downgrade to the whole job?
>>
>> regards,
>> Lin
>>
>>
>>
>
>
> --
> Bertrand Dechoux
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Hadoop counter

Posted by Jay Vyas <ja...@gmail.com>.

Ah this answers alot about why some of my dynamic counters never show up
and i have to bite my nails waiting to see whats going on until the end of
the job- thanks.

Another question: what happens if a task fails ?  What happen to the
counters for it ?  Do they dissappear into the ether? Or do they get merged
in with the counters from other tasks?

On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <de...@gmail.com>wrote:

> And by default the number of counters is limited to 120 with the
> mapreduce.job.counters.limit property.
> They are useful for displaying short statistics about a job but should not
> be used for results (imho).
> I know people may misuse them but I haven't tried so I wouldn't be able to
> list the caveats.
>
> Regards
>
> Bertrand
>
>
> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> As I understand it... each Task has its own counters and are
>> independently updated. As they report back to the JT, they update the
>> counter(s)' status.
>> The JT then will aggregate them.
>>
>> In terms of performance, Counters take up some memory in the JT so while
>> its OK to use them, if you abuse them, you can run in to issues.
>> As to limits... I guess that will depend on the amount of memory on the
>> JT machine, the size of the cluster (Number of TT) and the number of
>> counters.
>>
>> In terms of global accessibility... Maybe.
>>
>> The reason I say maybe is that I'm not sure by what you mean by globally
>> accessible.
>> If a task creates and implements a dynamic counter... I know that it will
>> eventually be reflected in the JT. However, I do not believe that a
>> separate Task could connect with the JT and see if the counter exists or if
>> it could get a value or even an accurate value since the updates are
>> asynchronous.  Not to mention that I don't believe that the counters are
>> aggregated until the job ends. It would make sense that the JT maintains a
>> unique counter for each task until the tasks complete. (If a task fails, it
>> would have to delete the counters so that when the task is restarted the
>> correct count is maintained. )  Note, I haven't looked at the source code
>> so I am probably wrong.
>>
>> HTH
>> Mike
>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>
>> Hi guys,
>>
>> I have some quick questions regarding to Hadoop counter,
>>
>>
>>    - Hadoop counter (customer defined) is global accessible (for both
>>    read and write) for all Mappers and Reducers in a job?
>>    - What is the performance and best practices of using Hadoop
>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>    performance downgrade to the whole job?
>>
>> regards,
>> Lin
>>
>>
>>
>
>
> --
> Bertrand Dechoux
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Hadoop counter

Posted by Jay Vyas <ja...@gmail.com>.

Ah this answers alot about why some of my dynamic counters never show up
and i have to bite my nails waiting to see whats going on until the end of
the job- thanks.

Another question: what happens if a task fails ?  What happen to the
counters for it ?  Do they dissappear into the ether? Or do they get merged
in with the counters from other tasks?

On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <de...@gmail.com>wrote:

> And by default the number of counters is limited to 120 with the
> mapreduce.job.counters.limit property.
> They are useful for displaying short statistics about a job but should not
> be used for results (imho).
> I know people may misuse them but I haven't tried so I wouldn't be able to
> list the caveats.
>
> Regards
>
> Bertrand
>
>
> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> As I understand it... each Task has its own counters and are
>> independently updated. As they report back to the JT, they update the
>> counter(s)' status.
>> The JT then will aggregate them.
>>
>> In terms of performance, Counters take up some memory in the JT so while
>> its OK to use them, if you abuse them, you can run in to issues.
>> As to limits... I guess that will depend on the amount of memory on the
>> JT machine, the size of the cluster (Number of TT) and the number of
>> counters.
>>
>> In terms of global accessibility... Maybe.
>>
>> The reason I say maybe is that I'm not sure by what you mean by globally
>> accessible.
>> If a task creates and implements a dynamic counter... I know that it will
>> eventually be reflected in the JT. However, I do not believe that a
>> separate Task could connect with the JT and see if the counter exists or if
>> it could get a value or even an accurate value since the updates are
>> asynchronous.  Not to mention that I don't believe that the counters are
>> aggregated until the job ends. It would make sense that the JT maintains a
>> unique counter for each task until the tasks complete. (If a task fails, it
>> would have to delete the counters so that when the task is restarted the
>> correct count is maintained. )  Note, I haven't looked at the source code
>> so I am probably wrong.
>>
>> HTH
>> Mike
>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>
>> Hi guys,
>>
>> I have some quick questions regarding to Hadoop counter,
>>
>>
>>    - Hadoop counter (customer defined) is global accessible (for both
>>    read and write) for all Mappers and Reducers in a job?
>>    - What is the performance and best practices of using Hadoop
>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>    performance downgrade to the whole job?
>>
>> regards,
>> Lin
>>
>>
>>
>
>
> --
> Bertrand Dechoux
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Hadoop counter

Posted by Jay Vyas <ja...@gmail.com>.

Ah this answers alot about why some of my dynamic counters never show up
and i have to bite my nails waiting to see whats going on until the end of
the job- thanks.

Another question: what happens if a task fails ?  What happen to the
counters for it ?  Do they dissappear into the ether? Or do they get merged
in with the counters from other tasks?

On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <de...@gmail.com>wrote:

> And by default the number of counters is limited to 120 with the
> mapreduce.job.counters.limit property.
> They are useful for displaying short statistics about a job but should not
> be used for results (imho).
> I know people may misuse them but I haven't tried so I wouldn't be able to
> list the caveats.
>
> Regards
>
> Bertrand
>
>
> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> As I understand it... each Task has its own counters and are
>> independently updated. As they report back to the JT, they update the
>> counter(s)' status.
>> The JT then will aggregate them.
>>
>> In terms of performance, Counters take up some memory in the JT so while
>> its OK to use them, if you abuse them, you can run in to issues.
>> As to limits... I guess that will depend on the amount of memory on the
>> JT machine, the size of the cluster (Number of TT) and the number of
>> counters.
>>
>> In terms of global accessibility... Maybe.
>>
>> The reason I say maybe is that I'm not sure by what you mean by globally
>> accessible.
>> If a task creates and implements a dynamic counter... I know that it will
>> eventually be reflected in the JT. However, I do not believe that a
>> separate Task could connect with the JT and see if the counter exists or if
>> it could get a value or even an accurate value since the updates are
>> asynchronous.  Not to mention that I don't believe that the counters are
>> aggregated until the job ends. It would make sense that the JT maintains a
>> unique counter for each task until the tasks complete. (If a task fails, it
>> would have to delete the counters so that when the task is restarted the
>> correct count is maintained. )  Note, I haven't looked at the source code
>> so I am probably wrong.
>>
>> HTH
>> Mike
>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>
>> Hi guys,
>>
>> I have some quick questions regarding to Hadoop counter,
>>
>>
>>    - Hadoop counter (customer defined) is global accessible (for both
>>    read and write) for all Mappers and Reducers in a job?
>>    - What is the performance and best practices of using Hadoop
>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>    performance downgrade to the whole job?
>>
>> regards,
>> Lin
>>
>>
>>
>
>
> --
> Bertrand Dechoux
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Hadoop counter

Posted by Jay Vyas <ja...@gmail.com>.

Ah this answers alot about why some of my dynamic counters never show up
and i have to bite my nails waiting to see whats going on until the end of
the job- thanks.

Another question: what happens if a task fails ?  What happen to the
counters for it ?  Do they dissappear into the ether? Or do they get merged
in with the counters from other tasks?

On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux <de...@gmail.com>wrote:

> And by default the number of counters is limited to 120 with the
> mapreduce.job.counters.limit property.
> They are useful for displaying short statistics about a job but should not
> be used for results (imho).
> I know people may misuse them but I haven't tried so I wouldn't be able to
> list the caveats.
>
> Regards
>
> Bertrand
>
>
> On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> As I understand it... each Task has its own counters and are
>> independently updated. As they report back to the JT, they update the
>> counter(s)' status.
>> The JT then will aggregate them.
>>
>> In terms of performance, Counters take up some memory in the JT so while
>> its OK to use them, if you abuse them, you can run in to issues.
>> As to limits... I guess that will depend on the amount of memory on the
>> JT machine, the size of the cluster (Number of TT) and the number of
>> counters.
>>
>> In terms of global accessibility... Maybe.
>>
>> The reason I say maybe is that I'm not sure by what you mean by globally
>> accessible.
>> If a task creates and implements a dynamic counter... I know that it will
>> eventually be reflected in the JT. However, I do not believe that a
>> separate Task could connect with the JT and see if the counter exists or if
>> it could get a value or even an accurate value since the updates are
>> asynchronous.  Not to mention that I don't believe that the counters are
>> aggregated until the job ends. It would make sense that the JT maintains a
>> unique counter for each task until the tasks complete. (If a task fails, it
>> would have to delete the counters so that when the task is restarted the
>> correct count is maintained. )  Note, I haven't looked at the source code
>> so I am probably wrong.
>>
>> HTH
>> Mike
>> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>>
>> Hi guys,
>>
>> I have some quick questions regarding to Hadoop counter,
>>
>>
>>    - Hadoop counter (customer defined) is global accessible (for both
>>    read and write) for all Mappers and Reducers in a job?
>>    - What is the performance and best practices of using Hadoop
>>    counters? I am not sure if using Hadoop counters too heavy, there will be
>>    performance downgrade to the whole job?
>>
>> regards,
>> Lin
>>
>>
>>
>
>
> --
> Bertrand Dechoux
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Hadoop counter

Posted by Bertrand Dechoux <de...@gmail.com>.

And by default the number of counters is limited to 120 with the
mapreduce.job.counters.limit property.
They are useful for displaying short statistics about a job but should not
be used for results (imho).
I know people may misuse them but I haven't tried so I wouldn't be able to
list the caveats.

Regards

Bertrand

On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel <mi...@hotmail.com>wrote:

> As I understand it... each Task has its own counters and are independently
> updated. As they report back to the JT, they update the counter(s)' status.
> The JT then will aggregate them.
>
> In terms of performance, Counters take up some memory in the JT so while
> its OK to use them, if you abuse them, you can run in to issues.
> As to limits... I guess that will depend on the amount of memory on the JT
> machine, the size of the cluster (Number of TT) and the number of counters.
>
> In terms of global accessibility... Maybe.
>
> The reason I say maybe is that I'm not sure by what you mean by globally
> accessible.
> If a task creates and implements a dynamic counter... I know that it will
> eventually be reflected in the JT. However, I do not believe that a
> separate Task could connect with the JT and see if the counter exists or if
> it could get a value or even an accurate value since the updates are
> asynchronous.  Not to mention that I don't believe that the counters are
> aggregated until the job ends. It would make sense that the JT maintains a
> unique counter for each task until the tasks complete. (If a task fails, it
> would have to delete the counters so that when the task is restarted the
> correct count is maintained. )  Note, I haven't looked at the source code
> so I am probably wrong.
>
> HTH
> Mike
> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>
> Hi guys,
>
> I have some quick questions regarding to Hadoop counter,
>
>
>    - Hadoop counter (customer defined) is global accessible (for both
>    read and write) for all Mappers and Reducers in a job?
>    - What is the performance and best practices of using Hadoop counters?
>    I am not sure if using Hadoop counters too heavy, there will be performance
>    downgrade to the whole job?
>
> regards,
> Lin
>
>
>


-- 
Bertrand Dechoux

Re: Hadoop counter

Posted by Bertrand Dechoux <de...@gmail.com>.

And by default the number of counters is limited to 120 with the
mapreduce.job.counters.limit property.
They are useful for displaying short statistics about a job but should not
be used for results (imho).
I know people may misuse them but I haven't tried so I wouldn't be able to
list the caveats.

Regards

Bertrand

On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel <mi...@hotmail.com>wrote:

> As I understand it... each Task has its own counters and are independently
> updated. As they report back to the JT, they update the counter(s)' status.
> The JT then will aggregate them.
>
> In terms of performance, Counters take up some memory in the JT so while
> its OK to use them, if you abuse them, you can run in to issues.
> As to limits... I guess that will depend on the amount of memory on the JT
> machine, the size of the cluster (Number of TT) and the number of counters.
>
> In terms of global accessibility... Maybe.
>
> The reason I say maybe is that I'm not sure by what you mean by globally
> accessible.
> If a task creates and implements a dynamic counter... I know that it will
> eventually be reflected in the JT. However, I do not believe that a
> separate Task could connect with the JT and see if the counter exists or if
> it could get a value or even an accurate value since the updates are
> asynchronous.  Not to mention that I don't believe that the counters are
> aggregated until the job ends. It would make sense that the JT maintains a
> unique counter for each task until the tasks complete. (If a task fails, it
> would have to delete the counters so that when the task is restarted the
> correct count is maintained. )  Note, I haven't looked at the source code
> so I am probably wrong.
>
> HTH
> Mike
> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>
> Hi guys,
>
> I have some quick questions regarding to Hadoop counter,
>
>
>    - Hadoop counter (customer defined) is global accessible (for both
>    read and write) for all Mappers and Reducers in a job?
>    - What is the performance and best practices of using Hadoop counters?
>    I am not sure if using Hadoop counters too heavy, there will be performance
>    downgrade to the whole job?
>
> regards,
> Lin
>
>
>


-- 
Bertrand Dechoux

Re: Hadoop counter

Posted by Bertrand Dechoux <de...@gmail.com>.

And by default the number of counters is limited to 120 with the
mapreduce.job.counters.limit property.
They are useful for displaying short statistics about a job but should not
be used for results (imho).
I know people may misuse them but I haven't tried so I wouldn't be able to
list the caveats.

Regards

Bertrand

On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel <mi...@hotmail.com>wrote:

> As I understand it... each Task has its own counters and are independently
> updated. As they report back to the JT, they update the counter(s)' status.
> The JT then will aggregate them.
>
> In terms of performance, Counters take up some memory in the JT so while
> its OK to use them, if you abuse them, you can run in to issues.
> As to limits... I guess that will depend on the amount of memory on the JT
> machine, the size of the cluster (Number of TT) and the number of counters.
>
> In terms of global accessibility... Maybe.
>
> The reason I say maybe is that I'm not sure by what you mean by globally
> accessible.
> If a task creates and implements a dynamic counter... I know that it will
> eventually be reflected in the JT. However, I do not believe that a
> separate Task could connect with the JT and see if the counter exists or if
> it could get a value or even an accurate value since the updates are
> asynchronous.  Not to mention that I don't believe that the counters are
> aggregated until the job ends. It would make sense that the JT maintains a
> unique counter for each task until the tasks complete. (If a task fails, it
> would have to delete the counters so that when the task is restarted the
> correct count is maintained. )  Note, I haven't looked at the source code
> so I am probably wrong.
>
> HTH
> Mike
> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>
> Hi guys,
>
> I have some quick questions regarding to Hadoop counter,
>
>
>    - Hadoop counter (customer defined) is global accessible (for both
>    read and write) for all Mappers and Reducers in a job?
>    - What is the performance and best practices of using Hadoop counters?
>    I am not sure if using Hadoop counters too heavy, there will be performance
>    downgrade to the whole job?
>
> regards,
> Lin
>
>
>


-- 
Bertrand Dechoux

Re: Hadoop counter

Posted by Bertrand Dechoux <de...@gmail.com>.

And by default the number of counters is limited to 120 with the
mapreduce.job.counters.limit property.
They are useful for displaying short statistics about a job but should not
be used for results (imho).
I know people may misuse them but I haven't tried so I wouldn't be able to
list the caveats.

Regards

Bertrand

On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel <mi...@hotmail.com>wrote:

> As I understand it... each Task has its own counters and are independently
> updated. As they report back to the JT, they update the counter(s)' status.
> The JT then will aggregate them.
>
> In terms of performance, Counters take up some memory in the JT so while
> its OK to use them, if you abuse them, you can run in to issues.
> As to limits... I guess that will depend on the amount of memory on the JT
> machine, the size of the cluster (Number of TT) and the number of counters.
>
> In terms of global accessibility... Maybe.
>
> The reason I say maybe is that I'm not sure by what you mean by globally
> accessible.
> If a task creates and implements a dynamic counter... I know that it will
> eventually be reflected in the JT. However, I do not believe that a
> separate Task could connect with the JT and see if the counter exists or if
> it could get a value or even an accurate value since the updates are
> asynchronous.  Not to mention that I don't believe that the counters are
> aggregated until the job ends. It would make sense that the JT maintains a
> unique counter for each task until the tasks complete. (If a task fails, it
> would have to delete the counters so that when the task is restarted the
> correct count is maintained. )  Note, I haven't looked at the source code
> so I am probably wrong.
>
> HTH
> Mike
> On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:
>
> Hi guys,
>
> I have some quick questions regarding to Hadoop counter,
>
>
>    - Hadoop counter (customer defined) is global accessible (for both
>    read and write) for all Mappers and Reducers in a job?
>    - What is the performance and best practices of using Hadoop counters?
>    I am not sure if using Hadoop counters too heavy, there will be performance
>    downgrade to the whole job?
>
> regards,
> Lin
>
>
>


-- 
Bertrand Dechoux

Re: Hadoop counter

Posted by Michael Segel <mi...@hotmail.com>.

As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status.
The JT then will aggregate them. 

In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them, you can run in to issues. 
As to limits... I guess that will depend on the amount of memory on the JT machine, the size of the cluster (Number of TT) and the number of counters. 

In terms of global accessibility... Maybe.

The reason I say maybe is that I'm not sure by what you mean by globally accessible. 
If a task creates and implements a dynamic counter... I know that it will eventually be reflected in the JT. However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous.  Not to mention that I don't believe that the counters are aggregated until the job ends. It would make sense that the JT maintains a unique counter for each task until the tasks complete. (If a task fails, it would have to delete the counters so that when the task is restarted the correct count is maintained. )  Note, I haven't looked at the source code so I am probably wrong. 

HTH
Mike
On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:

> Hi guys,
> 
> I have some quick questions regarding to Hadoop counter,
> 
> Hadoop counter (customer defined) is global accessible (for both read and write) for all Mappers and Reducers in a job?
> What is the performance and best practices of using Hadoop counters? I am not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole job?
> regards,
> Lin

Re: Hadoop counter

Posted by Michael Segel <mi...@hotmail.com>.

As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status.
The JT then will aggregate them. 

In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them, you can run in to issues. 
As to limits... I guess that will depend on the amount of memory on the JT machine, the size of the cluster (Number of TT) and the number of counters. 

In terms of global accessibility... Maybe.

The reason I say maybe is that I'm not sure by what you mean by globally accessible. 
If a task creates and implements a dynamic counter... I know that it will eventually be reflected in the JT. However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous.  Not to mention that I don't believe that the counters are aggregated until the job ends. It would make sense that the JT maintains a unique counter for each task until the tasks complete. (If a task fails, it would have to delete the counters so that when the task is restarted the correct count is maintained. )  Note, I haven't looked at the source code so I am probably wrong. 

HTH
Mike
On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:

> Hi guys,
> 
> I have some quick questions regarding to Hadoop counter,
> 
> Hadoop counter (customer defined) is global accessible (for both read and write) for all Mappers and Reducers in a job?
> What is the performance and best practices of using Hadoop counters? I am not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole job?
> regards,
> Lin

Re: Hadoop counter

Posted by Michael Segel <mi...@hotmail.com>.

As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status.
The JT then will aggregate them. 

In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them, you can run in to issues. 
As to limits... I guess that will depend on the amount of memory on the JT machine, the size of the cluster (Number of TT) and the number of counters. 

In terms of global accessibility... Maybe.

The reason I say maybe is that I'm not sure by what you mean by globally accessible. 
If a task creates and implements a dynamic counter... I know that it will eventually be reflected in the JT. However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous.  Not to mention that I don't believe that the counters are aggregated until the job ends. It would make sense that the JT maintains a unique counter for each task until the tasks complete. (If a task fails, it would have to delete the counters so that when the task is restarted the correct count is maintained. )  Note, I haven't looked at the source code so I am probably wrong. 

HTH
Mike
On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:

> Hi guys,
> 
> I have some quick questions regarding to Hadoop counter,
> 
> Hadoop counter (customer defined) is global accessible (for both read and write) for all Mappers and Reducers in a job?
> What is the performance and best practices of using Hadoop counters? I am not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole job?
> regards,
> Lin

Re: Hadoop counter

Posted by Michael Segel <mi...@hotmail.com>.

As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status.
The JT then will aggregate them. 

In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them, you can run in to issues. 
As to limits... I guess that will depend on the amount of memory on the JT machine, the size of the cluster (Number of TT) and the number of counters. 

In terms of global accessibility... Maybe.

The reason I say maybe is that I'm not sure by what you mean by globally accessible. 
If a task creates and implements a dynamic counter... I know that it will eventually be reflected in the JT. However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous.  Not to mention that I don't believe that the counters are aggregated until the job ends. It would make sense that the JT maintains a unique counter for each task until the tasks complete. (If a task fails, it would have to delete the counters so that when the task is restarted the correct count is maintained. )  Note, I haven't looked at the source code so I am probably wrong. 

HTH
Mike
On Oct 19, 2012, at 5:50 AM, Lin Ma <li...@gmail.com> wrote:

> Hi guys,
> 
> I have some quick questions regarding to Hadoop counter,
> 
> Hadoop counter (customer defined) is global accessible (for both read and write) for all Mappers and Reducers in a job?
> What is the performance and best practices of using Hadoop counters? I am not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole job?
> regards,
> Lin