You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by "Chandra Mohan, Ananda Vel Murugan" <An...@honeywell.com> on 2013/11/04 15:01:59 UTC
Running map reduce programmatically is unusually slow
Hi,
I have written a small utility to run map reduce job programmatically. My aim is to run my map reduce job without using hadoop shell script. I am planning to call this utility from another application.
Following is the code which runs the map reduce job. I have bundled this java class into a jar (remotemr.jar ). I have the actual map reduce job bundled inside another jar (mapreduce.jar)
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mapred.jobcontrol.Job;
import org.apache.hadoop.mapred.jobcontrol.JobControl;
public class RemoteMapreduce {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
String inputPath = args[0];
String outputPath = args[1];
String specFilePath=args[2];
Configuration config = new Configuration();
config.addResource(new Path("/opt/hadoop-1.0.2/bin/core-site.xml"));
config.addResource(new Path("/opt/hadoop-1.0.2/bin/hdfs-site.xml"));
JobConf jobConf = new JobConf(config);
jobConf.set("hadoop.tmp.dir ", "/tmp/hadoop-ananda/");
jobConf.setJar("/home/ananda/mapreduce.jar");
jobConf.setMapperClass(Myjob.MapClass.class);
SequenceFileInputFormat.setInputPaths(jobConf, new Path(inputPath));
TextOutputFormat.setOutputPath(jobConf, new Path(outputPath));
jobConf.setMapOutputKeyClass(Text.class);
jobConf.setMapOutputValueClass(Text.class);
jobConf.setInputFormat(SequenceFileInputFormat.class);
jobConf.setOutputFormat(TextOutputFormat.class);
jobConf.setOutputKeyClass(Text.class);
jobConf.setOutputValueClass(Text.class);
jobConf.set("specPath", specFilePath);
jobConf.setUser("ananda");
Job job1 = new Job(jobConf);
JobClient jc = new JobClient(jobConf);
jc.submitJob(jobConf);
/* JobControl ctrl = new JobControl("dar");
ctrl.addJob(job1);
ctrl.run();*/
System.out.println("Job launched!");
}
}
I am running it as follows
java -cp <all hadoop jars needed for the job>:/home/ananda/mapreduce.jar:/home/Ananda/remotemr.jar RemoteMapreduce <inputpath> <outputpath> <specpath>
It runs without any error. But it takes longer time than what it takes when I run it using hadoop shell script. One more thing is all the three input paths needs to be fully qualified HDFS paths i.e. hdfs://<hostname>:<port>/<path>. If I give partial paths as in hadoop shell script, I am getting input path not found errors. Am I doing anything wrong? Please help. Thanks
Regards,
Anand.C
RE: Running map reduce programmatically is unusually slow
Posted by "Chandra Mohan, Ananda Vel Murugan" <An...@honeywell.com>.
Hi,
Today morning, I noticed one more weird thing. When I run the map reduce job using this utility, it does not show up in JobTracker web UI. Any one has any clue? Please help. Thanks.
Regards,
Anand.C
From: Chandra Mohan, Ananda Vel Murugan [mailto:Ananda.Murugan@honeywell.com]
Sent: Monday, November 04, 2013 7:32 PM
To: user@hadoop.apache.org
Subject: Running map reduce programmatically is unusually slow
Hi,
I have written a small utility to run map reduce job programmatically. My aim is to run my map reduce job without using hadoop shell script. I am planning to call this utility from another application.
Following is the code which runs the map reduce job. I have bundled this java class into a jar (remotemr.jar ). I have the actual map reduce job bundled inside another jar (mapreduce.jar)
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mapred.jobcontrol.Job;
import org.apache.hadoop.mapred.jobcontrol.JobControl;
public class RemoteMapreduce {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
String inputPath = args[0];
String outputPath = args[1];
String specFilePath=args[2];
Configuration config = new Configuration();
config.addResource(new Path("/opt/hadoop-1.0.2/bin/core-site.xml"));
config.addResource(new Path("/opt/hadoop-1.0.2/bin/hdfs-site.xml"));
JobConf jobConf = new JobConf(config);
jobConf.set("hadoop.tmp.dir ", "/tmp/hadoop-ananda/");
jobConf.setJar("/home/ananda/mapreduce.jar");
jobConf.setMapperClass(Myjob.MapClass.class);
SequenceFileInputFormat.setInputPaths(jobConf, new Path(inputPath));
TextOutputFormat.setOutputPath(jobConf, new Path(outputPath));
jobConf.setMapOutputKeyClass(Text.class);
jobConf.setMapOutputValueClass(Text.class);
jobConf.setInputFormat(SequenceFileInputFormat.class);
jobConf.setOutputFormat(TextOutputFormat.class);
jobConf.setOutputKeyClass(Text.class);
jobConf.setOutputValueClass(Text.class);
jobConf.set("specPath", specFilePath);
jobConf.setUser("ananda");
Job job1 = new Job(jobConf);
JobClient jc = new JobClient(jobConf);
jc.submitJob(jobConf);
/* JobControl ctrl = new JobControl("dar");
ctrl.addJob(job1);
ctrl.run();*/
System.out.println("Job launched!");
}
}
I am running it as follows
java -cp <all hadoop jars needed for the job>:/home/ananda/mapreduce.jar:/home/Ananda/remotemr.jar RemoteMapreduce <inputpath> <outputpath> <specpath>
It runs without any error. But it takes longer time than what it takes when I run it using hadoop shell script. One more thing is all the three input paths needs to be fully qualified HDFS paths i.e. hdfs://<hostname>:<port>/<path>. If I give partial paths as in hadoop shell script, I am getting input path not found errors. Am I doing anything wrong? Please help. Thanks
Regards,
Anand.C
RE: Running map reduce programmatically is unusually slow
Posted by "Chandra Mohan, Ananda Vel Murugan" <An...@honeywell.com>.
Hi,
Today morning, I noticed one more weird thing. When I run the map reduce job using this utility, it does not show up in JobTracker web UI. Any one has any clue? Please help. Thanks.
Regards,
Anand.C
From: Chandra Mohan, Ananda Vel Murugan [mailto:Ananda.Murugan@honeywell.com]
Sent: Monday, November 04, 2013 7:32 PM
To: user@hadoop.apache.org
Subject: Running map reduce programmatically is unusually slow
Hi,
I have written a small utility to run map reduce job programmatically. My aim is to run my map reduce job without using hadoop shell script. I am planning to call this utility from another application.
Following is the code which runs the map reduce job. I have bundled this java class into a jar (remotemr.jar ). I have the actual map reduce job bundled inside another jar (mapreduce.jar)
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mapred.jobcontrol.Job;
import org.apache.hadoop.mapred.jobcontrol.JobControl;
public class RemoteMapreduce {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
String inputPath = args[0];
String outputPath = args[1];
String specFilePath=args[2];
Configuration config = new Configuration();
config.addResource(new Path("/opt/hadoop-1.0.2/bin/core-site.xml"));
config.addResource(new Path("/opt/hadoop-1.0.2/bin/hdfs-site.xml"));
JobConf jobConf = new JobConf(config);
jobConf.set("hadoop.tmp.dir ", "/tmp/hadoop-ananda/");
jobConf.setJar("/home/ananda/mapreduce.jar");
jobConf.setMapperClass(Myjob.MapClass.class);
SequenceFileInputFormat.setInputPaths(jobConf, new Path(inputPath));
TextOutputFormat.setOutputPath(jobConf, new Path(outputPath));
jobConf.setMapOutputKeyClass(Text.class);
jobConf.setMapOutputValueClass(Text.class);
jobConf.setInputFormat(SequenceFileInputFormat.class);
jobConf.setOutputFormat(TextOutputFormat.class);
jobConf.setOutputKeyClass(Text.class);
jobConf.setOutputValueClass(Text.class);
jobConf.set("specPath", specFilePath);
jobConf.setUser("ananda");
Job job1 = new Job(jobConf);
JobClient jc = new JobClient(jobConf);
jc.submitJob(jobConf);
/* JobControl ctrl = new JobControl("dar");
ctrl.addJob(job1);
ctrl.run();*/
System.out.println("Job launched!");
}
}
I am running it as follows
java -cp <all hadoop jars needed for the job>:/home/ananda/mapreduce.jar:/home/Ananda/remotemr.jar RemoteMapreduce <inputpath> <outputpath> <specpath>
It runs without any error. But it takes longer time than what it takes when I run it using hadoop shell script. One more thing is all the three input paths needs to be fully qualified HDFS paths i.e. hdfs://<hostname>:<port>/<path>. If I give partial paths as in hadoop shell script, I am getting input path not found errors. Am I doing anything wrong? Please help. Thanks
Regards,
Anand.C
RE: Running map reduce programmatically is unusually slow
Posted by "Chandra Mohan, Ananda Vel Murugan" <An...@honeywell.com>.
Hi,
Today morning, I noticed one more weird thing. When I run the map reduce job using this utility, it does not show up in JobTracker web UI. Any one has any clue? Please help. Thanks.
Regards,
Anand.C
From: Chandra Mohan, Ananda Vel Murugan [mailto:Ananda.Murugan@honeywell.com]
Sent: Monday, November 04, 2013 7:32 PM
To: user@hadoop.apache.org
Subject: Running map reduce programmatically is unusually slow
Hi,
I have written a small utility to run map reduce job programmatically. My aim is to run my map reduce job without using hadoop shell script. I am planning to call this utility from another application.
Following is the code which runs the map reduce job. I have bundled this java class into a jar (remotemr.jar ). I have the actual map reduce job bundled inside another jar (mapreduce.jar)
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mapred.jobcontrol.Job;
import org.apache.hadoop.mapred.jobcontrol.JobControl;
public class RemoteMapreduce {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
String inputPath = args[0];
String outputPath = args[1];
String specFilePath=args[2];
Configuration config = new Configuration();
config.addResource(new Path("/opt/hadoop-1.0.2/bin/core-site.xml"));
config.addResource(new Path("/opt/hadoop-1.0.2/bin/hdfs-site.xml"));
JobConf jobConf = new JobConf(config);
jobConf.set("hadoop.tmp.dir ", "/tmp/hadoop-ananda/");
jobConf.setJar("/home/ananda/mapreduce.jar");
jobConf.setMapperClass(Myjob.MapClass.class);
SequenceFileInputFormat.setInputPaths(jobConf, new Path(inputPath));
TextOutputFormat.setOutputPath(jobConf, new Path(outputPath));
jobConf.setMapOutputKeyClass(Text.class);
jobConf.setMapOutputValueClass(Text.class);
jobConf.setInputFormat(SequenceFileInputFormat.class);
jobConf.setOutputFormat(TextOutputFormat.class);
jobConf.setOutputKeyClass(Text.class);
jobConf.setOutputValueClass(Text.class);
jobConf.set("specPath", specFilePath);
jobConf.setUser("ananda");
Job job1 = new Job(jobConf);
JobClient jc = new JobClient(jobConf);
jc.submitJob(jobConf);
/* JobControl ctrl = new JobControl("dar");
ctrl.addJob(job1);
ctrl.run();*/
System.out.println("Job launched!");
}
}
I am running it as follows
java -cp <all hadoop jars needed for the job>:/home/ananda/mapreduce.jar:/home/Ananda/remotemr.jar RemoteMapreduce <inputpath> <outputpath> <specpath>
It runs without any error. But it takes longer time than what it takes when I run it using hadoop shell script. One more thing is all the three input paths needs to be fully qualified HDFS paths i.e. hdfs://<hostname>:<port>/<path>. If I give partial paths as in hadoop shell script, I am getting input path not found errors. Am I doing anything wrong? Please help. Thanks
Regards,
Anand.C
RE: Running map reduce programmatically is unusually slow
Posted by "Chandra Mohan, Ananda Vel Murugan" <An...@honeywell.com>.
Hi,
Today morning, I noticed one more weird thing. When I run the map reduce job using this utility, it does not show up in JobTracker web UI. Any one has any clue? Please help. Thanks.
Regards,
Anand.C
From: Chandra Mohan, Ananda Vel Murugan [mailto:Ananda.Murugan@honeywell.com]
Sent: Monday, November 04, 2013 7:32 PM
To: user@hadoop.apache.org
Subject: Running map reduce programmatically is unusually slow
Hi,
I have written a small utility to run map reduce job programmatically. My aim is to run my map reduce job without using hadoop shell script. I am planning to call this utility from another application.
Following is the code which runs the map reduce job. I have bundled this java class into a jar (remotemr.jar ). I have the actual map reduce job bundled inside another jar (mapreduce.jar)
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mapred.jobcontrol.Job;
import org.apache.hadoop.mapred.jobcontrol.JobControl;
public class RemoteMapreduce {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
String inputPath = args[0];
String outputPath = args[1];
String specFilePath=args[2];
Configuration config = new Configuration();
config.addResource(new Path("/opt/hadoop-1.0.2/bin/core-site.xml"));
config.addResource(new Path("/opt/hadoop-1.0.2/bin/hdfs-site.xml"));
JobConf jobConf = new JobConf(config);
jobConf.set("hadoop.tmp.dir ", "/tmp/hadoop-ananda/");
jobConf.setJar("/home/ananda/mapreduce.jar");
jobConf.setMapperClass(Myjob.MapClass.class);
SequenceFileInputFormat.setInputPaths(jobConf, new Path(inputPath));
TextOutputFormat.setOutputPath(jobConf, new Path(outputPath));
jobConf.setMapOutputKeyClass(Text.class);
jobConf.setMapOutputValueClass(Text.class);
jobConf.setInputFormat(SequenceFileInputFormat.class);
jobConf.setOutputFormat(TextOutputFormat.class);
jobConf.setOutputKeyClass(Text.class);
jobConf.setOutputValueClass(Text.class);
jobConf.set("specPath", specFilePath);
jobConf.setUser("ananda");
Job job1 = new Job(jobConf);
JobClient jc = new JobClient(jobConf);
jc.submitJob(jobConf);
/* JobControl ctrl = new JobControl("dar");
ctrl.addJob(job1);
ctrl.run();*/
System.out.println("Job launched!");
}
}
I am running it as follows
java -cp <all hadoop jars needed for the job>:/home/ananda/mapreduce.jar:/home/Ananda/remotemr.jar RemoteMapreduce <inputpath> <outputpath> <specpath>
It runs without any error. But it takes longer time than what it takes when I run it using hadoop shell script. One more thing is all the three input paths needs to be fully qualified HDFS paths i.e. hdfs://<hostname>:<port>/<path>. If I give partial paths as in hadoop shell script, I am getting input path not found errors. Am I doing anything wrong? Please help. Thanks
Regards,
Anand.C