You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by "Chandra Mohan, Ananda Vel Murugan" <An...@honeywell.com> on 2013/11/04 15:01:59 UTC

Running map reduce programmatically is unusually slow

Hi,

I have written a small utility to run map reduce job programmatically. My aim is to run my map reduce job without using hadoop shell script. I am planning to call this utility from another application.

Following is the code which runs the map reduce job. I have bundled this java class into a jar (remotemr.jar ). I have the actual map reduce job bundled inside another jar (mapreduce.jar)

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mapred.jobcontrol.Job;
import org.apache.hadoop.mapred.jobcontrol.JobControl;


public class RemoteMapreduce {

       public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {

               String inputPath = args[0];
              String outputPath = args[1];
              String specFilePath=args[2];
              Configuration config = new Configuration();
              config.addResource(new Path("/opt/hadoop-1.0.2/bin/core-site.xml"));
              config.addResource(new Path("/opt/hadoop-1.0.2/bin/hdfs-site.xml"));
              JobConf jobConf = new JobConf(config);
              jobConf.set("hadoop.tmp.dir ", "/tmp/hadoop-ananda/");
              jobConf.setJar("/home/ananda/mapreduce.jar");
              jobConf.setMapperClass(Myjob.MapClass.class);
              SequenceFileInputFormat.setInputPaths(jobConf, new Path(inputPath));
              TextOutputFormat.setOutputPath(jobConf, new Path(outputPath));
              jobConf.setMapOutputKeyClass(Text.class);
              jobConf.setMapOutputValueClass(Text.class);
              jobConf.setInputFormat(SequenceFileInputFormat.class);
              jobConf.setOutputFormat(TextOutputFormat.class);
              jobConf.setOutputKeyClass(Text.class);
              jobConf.setOutputValueClass(Text.class);
              jobConf.set("specPath", specFilePath);
              jobConf.setUser("ananda");
              Job job1 = new Job(jobConf);
              JobClient jc = new JobClient(jobConf);
              jc.submitJob(jobConf);
              /* JobControl ctrl = new JobControl("dar");
              ctrl.addJob(job1);
              ctrl.run();*/

              System.out.println("Job launched!");

       }
}


I am running it as follows

java -cp  <all hadoop jars needed for the job>:/home/ananda/mapreduce.jar:/home/Ananda/remotemr.jar  RemoteMapreduce <inputpath> <outputpath> <specpath>

It runs without any error. But it takes longer time than what it takes when I run it using hadoop shell script. One more thing is all the three input paths needs to be fully qualified HDFS paths i.e. hdfs://<hostname>:<port>/<path>. If I give partial paths as in hadoop shell script, I am getting input path not found errors. Am I doing anything wrong? Please help. Thanks

Regards,
Anand.C

RE: Running map reduce programmatically is unusually slow

Posted by "Chandra Mohan, Ananda Vel Murugan" <An...@honeywell.com>.
Hi,

Today morning, I noticed one more weird thing. When I run the map reduce job using this utility, it does not show up in JobTracker web UI. Any one has any clue? Please help. Thanks.

Regards,
Anand.C

From: Chandra Mohan, Ananda Vel Murugan [mailto:Ananda.Murugan@honeywell.com]
Sent: Monday, November 04, 2013 7:32 PM
To: user@hadoop.apache.org
Subject: Running map reduce programmatically is unusually slow

Hi,

I have written a small utility to run map reduce job programmatically. My aim is to run my map reduce job without using hadoop shell script. I am planning to call this utility from another application.

Following is the code which runs the map reduce job. I have bundled this java class into a jar (remotemr.jar ). I have the actual map reduce job bundled inside another jar (mapreduce.jar)

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mapred.jobcontrol.Job;
import org.apache.hadoop.mapred.jobcontrol.JobControl;


public class RemoteMapreduce {

       public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {

               String inputPath = args[0];
              String outputPath = args[1];
              String specFilePath=args[2];
              Configuration config = new Configuration();
              config.addResource(new Path("/opt/hadoop-1.0.2/bin/core-site.xml"));
              config.addResource(new Path("/opt/hadoop-1.0.2/bin/hdfs-site.xml"));
              JobConf jobConf = new JobConf(config);
              jobConf.set("hadoop.tmp.dir ", "/tmp/hadoop-ananda/");
              jobConf.setJar("/home/ananda/mapreduce.jar");
              jobConf.setMapperClass(Myjob.MapClass.class);
              SequenceFileInputFormat.setInputPaths(jobConf, new Path(inputPath));
              TextOutputFormat.setOutputPath(jobConf, new Path(outputPath));
              jobConf.setMapOutputKeyClass(Text.class);
              jobConf.setMapOutputValueClass(Text.class);
              jobConf.setInputFormat(SequenceFileInputFormat.class);
              jobConf.setOutputFormat(TextOutputFormat.class);
              jobConf.setOutputKeyClass(Text.class);
              jobConf.setOutputValueClass(Text.class);
              jobConf.set("specPath", specFilePath);
              jobConf.setUser("ananda");
              Job job1 = new Job(jobConf);
              JobClient jc = new JobClient(jobConf);
              jc.submitJob(jobConf);
              /* JobControl ctrl = new JobControl("dar");
              ctrl.addJob(job1);
              ctrl.run();*/

              System.out.println("Job launched!");

       }
}


I am running it as follows

java -cp  <all hadoop jars needed for the job>:/home/ananda/mapreduce.jar:/home/Ananda/remotemr.jar  RemoteMapreduce <inputpath> <outputpath> <specpath>

It runs without any error. But it takes longer time than what it takes when I run it using hadoop shell script. One more thing is all the three input paths needs to be fully qualified HDFS paths i.e. hdfs://<hostname>:<port>/<path>. If I give partial paths as in hadoop shell script, I am getting input path not found errors. Am I doing anything wrong? Please help. Thanks

Regards,
Anand.C

RE: Running map reduce programmatically is unusually slow

Posted by "Chandra Mohan, Ananda Vel Murugan" <An...@honeywell.com>.
Hi,

Today morning, I noticed one more weird thing. When I run the map reduce job using this utility, it does not show up in JobTracker web UI. Any one has any clue? Please help. Thanks.

Regards,
Anand.C

From: Chandra Mohan, Ananda Vel Murugan [mailto:Ananda.Murugan@honeywell.com]
Sent: Monday, November 04, 2013 7:32 PM
To: user@hadoop.apache.org
Subject: Running map reduce programmatically is unusually slow

Hi,

I have written a small utility to run map reduce job programmatically. My aim is to run my map reduce job without using hadoop shell script. I am planning to call this utility from another application.

Following is the code which runs the map reduce job. I have bundled this java class into a jar (remotemr.jar ). I have the actual map reduce job bundled inside another jar (mapreduce.jar)

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mapred.jobcontrol.Job;
import org.apache.hadoop.mapred.jobcontrol.JobControl;


public class RemoteMapreduce {

       public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {

               String inputPath = args[0];
              String outputPath = args[1];
              String specFilePath=args[2];
              Configuration config = new Configuration();
              config.addResource(new Path("/opt/hadoop-1.0.2/bin/core-site.xml"));
              config.addResource(new Path("/opt/hadoop-1.0.2/bin/hdfs-site.xml"));
              JobConf jobConf = new JobConf(config);
              jobConf.set("hadoop.tmp.dir ", "/tmp/hadoop-ananda/");
              jobConf.setJar("/home/ananda/mapreduce.jar");
              jobConf.setMapperClass(Myjob.MapClass.class);
              SequenceFileInputFormat.setInputPaths(jobConf, new Path(inputPath));
              TextOutputFormat.setOutputPath(jobConf, new Path(outputPath));
              jobConf.setMapOutputKeyClass(Text.class);
              jobConf.setMapOutputValueClass(Text.class);
              jobConf.setInputFormat(SequenceFileInputFormat.class);
              jobConf.setOutputFormat(TextOutputFormat.class);
              jobConf.setOutputKeyClass(Text.class);
              jobConf.setOutputValueClass(Text.class);
              jobConf.set("specPath", specFilePath);
              jobConf.setUser("ananda");
              Job job1 = new Job(jobConf);
              JobClient jc = new JobClient(jobConf);
              jc.submitJob(jobConf);
              /* JobControl ctrl = new JobControl("dar");
              ctrl.addJob(job1);
              ctrl.run();*/

              System.out.println("Job launched!");

       }
}


I am running it as follows

java -cp  <all hadoop jars needed for the job>:/home/ananda/mapreduce.jar:/home/Ananda/remotemr.jar  RemoteMapreduce <inputpath> <outputpath> <specpath>

It runs without any error. But it takes longer time than what it takes when I run it using hadoop shell script. One more thing is all the three input paths needs to be fully qualified HDFS paths i.e. hdfs://<hostname>:<port>/<path>. If I give partial paths as in hadoop shell script, I am getting input path not found errors. Am I doing anything wrong? Please help. Thanks

Regards,
Anand.C

RE: Running map reduce programmatically is unusually slow

Posted by "Chandra Mohan, Ananda Vel Murugan" <An...@honeywell.com>.
Hi,

Today morning, I noticed one more weird thing. When I run the map reduce job using this utility, it does not show up in JobTracker web UI. Any one has any clue? Please help. Thanks.

Regards,
Anand.C

From: Chandra Mohan, Ananda Vel Murugan [mailto:Ananda.Murugan@honeywell.com]
Sent: Monday, November 04, 2013 7:32 PM
To: user@hadoop.apache.org
Subject: Running map reduce programmatically is unusually slow

Hi,

I have written a small utility to run map reduce job programmatically. My aim is to run my map reduce job without using hadoop shell script. I am planning to call this utility from another application.

Following is the code which runs the map reduce job. I have bundled this java class into a jar (remotemr.jar ). I have the actual map reduce job bundled inside another jar (mapreduce.jar)

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mapred.jobcontrol.Job;
import org.apache.hadoop.mapred.jobcontrol.JobControl;


public class RemoteMapreduce {

       public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {

               String inputPath = args[0];
              String outputPath = args[1];
              String specFilePath=args[2];
              Configuration config = new Configuration();
              config.addResource(new Path("/opt/hadoop-1.0.2/bin/core-site.xml"));
              config.addResource(new Path("/opt/hadoop-1.0.2/bin/hdfs-site.xml"));
              JobConf jobConf = new JobConf(config);
              jobConf.set("hadoop.tmp.dir ", "/tmp/hadoop-ananda/");
              jobConf.setJar("/home/ananda/mapreduce.jar");
              jobConf.setMapperClass(Myjob.MapClass.class);
              SequenceFileInputFormat.setInputPaths(jobConf, new Path(inputPath));
              TextOutputFormat.setOutputPath(jobConf, new Path(outputPath));
              jobConf.setMapOutputKeyClass(Text.class);
              jobConf.setMapOutputValueClass(Text.class);
              jobConf.setInputFormat(SequenceFileInputFormat.class);
              jobConf.setOutputFormat(TextOutputFormat.class);
              jobConf.setOutputKeyClass(Text.class);
              jobConf.setOutputValueClass(Text.class);
              jobConf.set("specPath", specFilePath);
              jobConf.setUser("ananda");
              Job job1 = new Job(jobConf);
              JobClient jc = new JobClient(jobConf);
              jc.submitJob(jobConf);
              /* JobControl ctrl = new JobControl("dar");
              ctrl.addJob(job1);
              ctrl.run();*/

              System.out.println("Job launched!");

       }
}


I am running it as follows

java -cp  <all hadoop jars needed for the job>:/home/ananda/mapreduce.jar:/home/Ananda/remotemr.jar  RemoteMapreduce <inputpath> <outputpath> <specpath>

It runs without any error. But it takes longer time than what it takes when I run it using hadoop shell script. One more thing is all the three input paths needs to be fully qualified HDFS paths i.e. hdfs://<hostname>:<port>/<path>. If I give partial paths as in hadoop shell script, I am getting input path not found errors. Am I doing anything wrong? Please help. Thanks

Regards,
Anand.C

RE: Running map reduce programmatically is unusually slow

Posted by "Chandra Mohan, Ananda Vel Murugan" <An...@honeywell.com>.
Hi,

Today morning, I noticed one more weird thing. When I run the map reduce job using this utility, it does not show up in JobTracker web UI. Any one has any clue? Please help. Thanks.

Regards,
Anand.C

From: Chandra Mohan, Ananda Vel Murugan [mailto:Ananda.Murugan@honeywell.com]
Sent: Monday, November 04, 2013 7:32 PM
To: user@hadoop.apache.org
Subject: Running map reduce programmatically is unusually slow

Hi,

I have written a small utility to run map reduce job programmatically. My aim is to run my map reduce job without using hadoop shell script. I am planning to call this utility from another application.

Following is the code which runs the map reduce job. I have bundled this java class into a jar (remotemr.jar ). I have the actual map reduce job bundled inside another jar (mapreduce.jar)

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mapred.jobcontrol.Job;
import org.apache.hadoop.mapred.jobcontrol.JobControl;


public class RemoteMapreduce {

       public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {

               String inputPath = args[0];
              String outputPath = args[1];
              String specFilePath=args[2];
              Configuration config = new Configuration();
              config.addResource(new Path("/opt/hadoop-1.0.2/bin/core-site.xml"));
              config.addResource(new Path("/opt/hadoop-1.0.2/bin/hdfs-site.xml"));
              JobConf jobConf = new JobConf(config);
              jobConf.set("hadoop.tmp.dir ", "/tmp/hadoop-ananda/");
              jobConf.setJar("/home/ananda/mapreduce.jar");
              jobConf.setMapperClass(Myjob.MapClass.class);
              SequenceFileInputFormat.setInputPaths(jobConf, new Path(inputPath));
              TextOutputFormat.setOutputPath(jobConf, new Path(outputPath));
              jobConf.setMapOutputKeyClass(Text.class);
              jobConf.setMapOutputValueClass(Text.class);
              jobConf.setInputFormat(SequenceFileInputFormat.class);
              jobConf.setOutputFormat(TextOutputFormat.class);
              jobConf.setOutputKeyClass(Text.class);
              jobConf.setOutputValueClass(Text.class);
              jobConf.set("specPath", specFilePath);
              jobConf.setUser("ananda");
              Job job1 = new Job(jobConf);
              JobClient jc = new JobClient(jobConf);
              jc.submitJob(jobConf);
              /* JobControl ctrl = new JobControl("dar");
              ctrl.addJob(job1);
              ctrl.run();*/

              System.out.println("Job launched!");

       }
}


I am running it as follows

java -cp  <all hadoop jars needed for the job>:/home/ananda/mapreduce.jar:/home/Ananda/remotemr.jar  RemoteMapreduce <inputpath> <outputpath> <specpath>

It runs without any error. But it takes longer time than what it takes when I run it using hadoop shell script. One more thing is all the three input paths needs to be fully qualified HDFS paths i.e. hdfs://<hostname>:<port>/<path>. If I give partial paths as in hadoop shell script, I am getting input path not found errors. Am I doing anything wrong? Please help. Thanks

Regards,
Anand.C